[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173941#comment-13173941 ] Dawid Weiss commented on LUCENE-3654: - There's been an interesting discussion about the common use of Unsafe on hotspot mailing list recently (can't recall the thread now though). Some people even wanted Unsafe to become part of standard library (not unsafe accesses -- the lock checking part, but nonetheless). This guy wrote an entire off-heap collections library on top of Unsafe: http://www.ohloh.net/p/java-huge-collections I think using Unsafe with a fallback is fine, especially in small-scope methods that are used frequently and can be thoroughly tested. BytesRef is such an example to me. This said, it would certainly help to convince Robert and others if you ran benchmarks alongside with and without Unsafe and show how much there is to gain, Shay. Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3662) extend LevenshteinAutomata to support transpositions as primitive edits
[ https://issues.apache.org/jira/browse/LUCENE-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173942#comment-13173942 ] Dawid Weiss commented on LUCENE-3662: - Avanti Robert! :) extend LevenshteinAutomata to support transpositions as primitive edits --- Key: LUCENE-3662 URL: https://issues.apache.org/jira/browse/LUCENE-3662 Project: Lucene - Java Issue Type: New Feature Affects Versions: 4.0 Reporter: Robert Muir Attachments: LUCENE-3662.patch, LUCENE-3662.patch, LUCENE-3662_upgrade_moman.patch, lev1.rev115.txt, lev1.rev119.txt, lev1t.txt, update-moman.patch This would be a nice improvement for spell correction: currently a transposition counts as 2 edits, which means users of DirectSpellChecker must use larger values of n (e.g. 2 instead of 1) and larger priority queue sizes, plus some sort of re-ranking with another distance measure for good results. Instead if we can integrate chapter 7 of http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 then you can just build an alternative DFA where a transposition is only a single edit (http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) According to the benchmarks in the original paper, the performance for LevT looks to be very similar to Lev. Support for this is now in moman (https://bitbucket.org/jpbarrette/moman/) thanks to Jean-Philippe Barrette-LaPierre. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk - Build # 11853 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/11853/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock Error Message: null Stack Trace: junit.framework.AssertionFailedError: at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) at org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock(TestIndexWriter.java:1270) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528) Build Log (for compile errors): [...truncated 1335 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173954#comment-13173954 ] Uwe Schindler commented on LUCENE-3654: --- I agree here, but before doing this, I want some non-micro-benchmarks to show the effect. If there is no real effect, don't do it. Inside Lucene the comparator is not so often used (mostly only in indexer/BytesRefHash) and in TermRangeQuery. The other use cases are asserts all over the place, but they don't count. I would agree to the patch if the class would be renamed to something like UnsignedBytesComparator and the part importing sun.misc.Unsafe to be outside the main compilation unit. So if somebody compiles with a strange JVM like Harmony (although its dead) and sun.misc.Unsafe is not available, the build succeeds. The code in BytesRef is using reflection to load the comparator implementation, so all is fine, it would just get ClassNotFoundEx and fallback to the Java one. I could help in doing the ANT magic. Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...
[ https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173959#comment-13173959 ] Uwe Schindler commented on LUCENE-3631: --- Hi, I committed some small cleanups and dead code removal after Clover analysis this morning. One thing: we have thread locals for TermVectorsReader and StoredFieldsReader. Would it make sense to use one for DocValues, too? What do you think Simon? Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/... - Key: LUCENE-3631 URL: https://issues.apache.org/jira/browse/LUCENE-3631 Project: Lucene - Java Issue Type: Task Components: core/index Affects Versions: 4.0 Reporter: Uwe Schindler Assignee: Michael McCandless Attachments: LUCENE-3631.patch, LUCENE-3631.patch After LUCENE-3606 is finished, there are some TODOs: SegmentReader still contains (package-private) all delete logic including crazy copyOnWrite for validDocs Bits. It would be good, if SegmentReader itsself could be read-only like all other IndexReaders. There are two possibilities to do this: # the simple one: Subclass SegmentReader and make a RWSegmentReader that is only used by IndexWriter/BufferedDeletes/... DirectoryReader will only use the read-only SegmentReader. This would move all TODOs to a separate class. It's reopen/clone method would always create a RO-SegmentReader (for NRT). # Remove all write and commit stuff from SegmentReader completely and move it to IndexWriter's readerPool (it must be in readerPool as deletions need a not-changing view on an index snapshot). Unfortunately the code is so complicated and I have no real experience in those internals of IndexWriter so I did not want to do it with LUCENE-3606, I just separated the code in SegmentReader and marked with TODO. Maybe Mike McCandless can help :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173962#comment-13173962 ] Robert Muir commented on LUCENE-3654: - the reason I am -1, I don't want JVM crashes. This is lucene java, users can expect not to have JVM crashes because of bytesref bugs in lucene (this class is used all over the place), they shoudl get AIOOBE and NPE and other things. So all is not fine just because it has a fallback. Convincing that there is performance win is a waste of time, this method is not a hotspot. Convincing me that nobody will get jvm crashes is going to be difficult. Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3662) extend LevenshteinAutomata to support transpositions as primitive edits
[ https://issues.apache.org/jira/browse/LUCENE-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173965#comment-13173965 ] Uwe Schindler commented on LUCENE-3662: --- How many beers did you need for that? extend LevenshteinAutomata to support transpositions as primitive edits --- Key: LUCENE-3662 URL: https://issues.apache.org/jira/browse/LUCENE-3662 Project: Lucene - Java Issue Type: New Feature Affects Versions: 4.0 Reporter: Robert Muir Attachments: LUCENE-3662.patch, LUCENE-3662.patch, LUCENE-3662_upgrade_moman.patch, lev1.rev115.txt, lev1.rev119.txt, lev1t.txt, update-moman.patch This would be a nice improvement for spell correction: currently a transposition counts as 2 edits, which means users of DirectSpellChecker must use larger values of n (e.g. 2 instead of 1) and larger priority queue sizes, plus some sort of re-ranking with another distance measure for good results. Instead if we can integrate chapter 7 of http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 then you can just build an alternative DFA where a transposition is only a single edit (http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) According to the benchmarks in the original paper, the performance for LevT looks to be very similar to Lev. Support for this is now in moman (https://bitbucket.org/jpbarrette/moman/) thanks to Jean-Philippe Barrette-LaPierre. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173971#comment-13173971 ] Robert Muir commented on LUCENE-3654: - Here's an example, since so much of the lucene codebase has bugs with bytesref offsets, i figure its a good example: {noformat} public void testOops() { BytesRef b = new BytesRef(abcdefghijklmnop); b.offset = -545454544; // some bug, integer overflows and goes negative or other problem System.out.println(b.compareTo(new BytesRef(abcdefghijklmnop))); } {noformat} With this patch, this gives me a SIGSEGV: {noformat} junit-sequential: [junit] Testsuite: org.apache.lucene.util.TestBytesRef [junit] # [junit] # A fatal error has been detected by the Java Runtime Environment: [junit] # [junit] # SIGSEGV (0xb) at pc=0x7f386e7dcf64, pid=6093, tid=139880338200320 [junit] # [junit] # JRE version: 6.0_24-b07 [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode linux-amd64 compressed oops) [junit] # Problematic frame: [junit] # V [libjvm.so+0x76ef64] [junit] # {noformat} Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173972#comment-13173972 ] Simon Willnauer commented on LUCENE-3654: - can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no? Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173973#comment-13173973 ] Uwe Schindler commented on LUCENE-3654: --- The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset=0 and offset+length=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods. Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...
[ https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173975#comment-13173975 ] Simon Willnauer commented on LUCENE-3631: - bq. One thing: we have thread locals for TermVectorsReader and StoredFieldsReader. Would it make sense to use one for DocValues, too? What do you think Simon? I don't see a need for this. The source is cached in the DocValues instance and DocValues instances can be shared across thread. Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/... - Key: LUCENE-3631 URL: https://issues.apache.org/jira/browse/LUCENE-3631 Project: Lucene - Java Issue Type: Task Components: core/index Affects Versions: 4.0 Reporter: Uwe Schindler Assignee: Michael McCandless Attachments: LUCENE-3631.patch, LUCENE-3631.patch After LUCENE-3606 is finished, there are some TODOs: SegmentReader still contains (package-private) all delete logic including crazy copyOnWrite for validDocs Bits. It would be good, if SegmentReader itsself could be read-only like all other IndexReaders. There are two possibilities to do this: # the simple one: Subclass SegmentReader and make a RWSegmentReader that is only used by IndexWriter/BufferedDeletes/... DirectoryReader will only use the read-only SegmentReader. This would move all TODOs to a separate class. It's reopen/clone method would always create a RO-SegmentReader (for NRT). # Remove all write and commit stuff from SegmentReader completely and move it to IndexWriter's readerPool (it must be in readerPool as deletions need a not-changing view on an index snapshot). Unfortunately the code is so complicated and I have no real experience in those internals of IndexWriter so I did not want to do it with LUCENE-3606, I just separated the code in SegmentReader and marked with TODO. Maybe Mike McCandless can help :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173976#comment-13173976 ] Uwe Schindler commented on LUCENE-3654: --- bq. can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no? We can put the whole comparator to contrib and BytesRef can have a static setter to change the default impl. Or we use SPI for it (contrib exports it in META-INF) :-) Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173977#comment-13173977 ] Robert Muir commented on LUCENE-3654: - Sorry, i'm totally against the change, even with safety checks. I think this will hurt the reputation of project, and i think it will be a nightmare for developers too (Sorry, i dont want to debug avoidable jvm crashes). And I don't want to see Lucene start using unsafe everywhere. This is lucene-java, things like bounds checking are part of the language. Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Solr-trunk - Build # 1711 - Still Failing
Build: https://builds.apache.org/job/Solr-trunk/1711/ No tests ran. Build Log (for compile errors): [...truncated 37341 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...
[ https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173983#comment-13173983 ] Uwe Schindler commented on LUCENE-3631: --- bq. The source is cached in the DocValues instance and DocValues instances can be shared across thread. Thanks, I just wanted to make sure that there is no synchronization on DocValues. A customer of mine had huge improvements with loading stored fields since this is in Lucene. Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/... - Key: LUCENE-3631 URL: https://issues.apache.org/jira/browse/LUCENE-3631 Project: Lucene - Java Issue Type: Task Components: core/index Affects Versions: 4.0 Reporter: Uwe Schindler Assignee: Michael McCandless Attachments: LUCENE-3631.patch, LUCENE-3631.patch After LUCENE-3606 is finished, there are some TODOs: SegmentReader still contains (package-private) all delete logic including crazy copyOnWrite for validDocs Bits. It would be good, if SegmentReader itsself could be read-only like all other IndexReaders. There are two possibilities to do this: # the simple one: Subclass SegmentReader and make a RWSegmentReader that is only used by IndexWriter/BufferedDeletes/... DirectoryReader will only use the read-only SegmentReader. This would move all TODOs to a separate class. It's reopen/clone method would always create a RO-SegmentReader (for NRT). # Remove all write and commit stuff from SegmentReader completely and move it to IndexWriter's readerPool (it must be in readerPool as deletions need a not-changing view on an index snapshot). Unfortunately the code is so complicated and I have no real experience in those internals of IndexWriter so I did not want to do it with LUCENE-3606, I just separated the code in SegmentReader and marked with TODO. Maybe Mike McCandless can help :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173973#comment-13173973 ] Uwe Schindler edited comment on LUCENE-3654 at 12/21/11 10:08 AM: -- The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset=0 and offset+length=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods. *EDIT* You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the *public* npon-final fields in BytesRef and cause OOM). BytesRef is a user-visible class so it must be 100% safe against all usage-violations. Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter. If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava. was (Author: thetaphi): The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset=0 and offset+length=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods. Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
[ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173973#comment-13173973 ] Uwe Schindler edited comment on LUCENE-3654 at 12/21/11 10:09 AM: -- The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset=0 and offset+length=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods. *EDIT* You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the *public* non-final fields in BytesRef and cause SIGSEGV). BytesRef is a user-visible class so it must be 100% safe against all usage-violations. Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter. If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava. was (Author: thetaphi): The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset=0 and offset+length=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods. *EDIT* You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the *public* npon-final fields in BytesRef and cause OOM). BytesRef is a user-visible class so it must be 100% safe against all usage-violations. Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter. If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava. Optimize BytesRef comparator to use Unsafe long based comparison (when possible) Key: LUCENE-3654 URL: https://issues.apache.org/jira/browse/LUCENE-3654 Project: Lucene - Java Issue Type: Improvement Components: core/index, core/search Reporter: Shay Banon Attachments: LUCENE-3654.patch Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2983) Unable to load custom MergePolicy
Unable to load custom MergePolicy - Key: SOLR-2983 URL: https://issues.apache.org/jira/browse/SOLR-2983 Project: Solr Issue Type: Bug Reporter: Mathias Herberts As part of a recent upgrade to Solr 3.5.0 we encountered an error related to our use of LinkedIn's ZoieMergePolicy. It seems the code that loads a custom MergePolicy was at some point moved into SolrIndexConfig.java from SolrIndexWriter.java, but as this code was copied verbatim it now contains a bug: try { policy = (MergePolicy) schema.getResourceLoader().newInstance(mpClassName, null, new Class[]{IndexWriter.class}, new Object[]{this}); } catch (Exception e) { policy = (MergePolicy) schema.getResourceLoader().newInstance(mpClassName); } 'this' is no longer an IndexWriter but a SolrIndexConfig, therefore the call to newInstance will always throw an exception and the catch clause will be executed. If the custom MergePolicy does not have a default constructor (which is the case of ZoieMergePolicy), the second attempt to create the MergePolicy will also fail and Solr won't start. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated LUCENE-3663: - Attachment: PhoneFilter.java This is a proof-of-concept TokenFilter that does the job using Google's libphonenumber (https://code.google.com/p/libphonenumber/). Each token is converted to a phone number in international format, using a default country for guessing country code if needed. If the token is not a valid phone number, it's filtered out. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3660) If indexwriter hits a non-ioexception from indexExists it leaks a write.lock
[ https://issues.apache.org/jira/browse/LUCENE-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173997#comment-13173997 ] Michael McCandless commented on LUCENE-3660: +1, good catch! If indexwriter hits a non-ioexception from indexExists it leaks a write.lock Key: LUCENE-3660 URL: https://issues.apache.org/jira/browse/LUCENE-3660 Project: Lucene - Java Issue Type: Bug Reporter: Robert Muir Attachments: LUCENE-3660.patch the rest of IW's ctor is careful about this. IndexReader.indexExists catches any IOException and returns false, but the problem occurs if some other exception (in my test, UnsupportedOperationException, but you can imagine others are possible), when trying to e.g. read in the segments file. I think we just need to move the IR.exists stuff inside the try / finally -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3605) revisit segments.gen sleeping
[ https://issues.apache.org/jira/browse/LUCENE-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173995#comment-13173995 ] Michael McCandless commented on LUCENE-3605: Woops -- I'll nuke the getter! revisit segments.gen sleeping - Key: LUCENE-3605 URL: https://issues.apache.org/jira/browse/LUCENE-3605 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Assignee: Michael McCandless Attachments: LUCENE-3605.patch in LUCENE-3601, i worked up a change where we intentionally crash() all un-fsynced files in tests to ensure that we are calling sync on files when we should. I think this would be nice to do always (and with some fixes all tests pass). But this is super-slow sometimes because when we corrupt the unsynced segments.gen, it causes SIS.read to take 500ms each time (and in checkindex for some reason we do this twice, which seems wrong). I can workaround this for now for tests (just do a partial crash that avoids corrupting the segments.gen), but I wanted to create this issue for discussion about the sleeping/non-fsyncing of segments.gen, just because i guess its possible someone could hit this slowness. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3661) move deletes under codec
[ https://issues.apache.org/jira/browse/LUCENE-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174000#comment-13174000 ] Michael McCandless commented on LUCENE-3661: This sounds like a great plan! So then the use of BitVector is an impl detail to the codec... move deletes under codec Key: LUCENE-3661 URL: https://issues.apache.org/jira/browse/LUCENE-3661 Project: Lucene - Java Issue Type: Task Affects Versions: 4.0 Reporter: Robert Muir After LUCENE-3631, this should be easier I think. I haven't looked at it much myself but i'll play around a bit, but at a glance: * SegmentReader to have Bits liveDocs instead of BitVector * address the TODO in the IW-using ctors so that SegmentReader doesn't take a parent but just an existing core. * we need some type of minimal MutableBits or similar subinterface of bits. BitVector and maybe Fixed/OpenBitSet could implement it * BitVector becomes an impl detail and moves to codec (maybe we have a shared base class and split the 3.x/4.x up rather than the conditional backwards) * I think the invertAll should not be used by IndexWriter, instead we define the codec interface to say give me a new MutableBits, by default all are set ? * redundant internally-consistent checks in checkLiveCounts should be done in the codec impl instead of in SegmentReader. * plain text impl in SimpleText. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...
[ https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-3631: -- Attachment: LUCENE-3631-threadlocals.patch This patch also moves the threadlocals to SegmentCoreReaders, as they can be reused on reopen/nrt readers. Also improve ensureOpen() checks to guard everything without duplicating checks. Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/... - Key: LUCENE-3631 URL: https://issues.apache.org/jira/browse/LUCENE-3631 Project: Lucene - Java Issue Type: Task Components: core/index Affects Versions: 4.0 Reporter: Uwe Schindler Assignee: Michael McCandless Attachments: LUCENE-3631-threadlocals.patch, LUCENE-3631.patch, LUCENE-3631.patch After LUCENE-3606 is finished, there are some TODOs: SegmentReader still contains (package-private) all delete logic including crazy copyOnWrite for validDocs Bits. It would be good, if SegmentReader itsself could be read-only like all other IndexReaders. There are two possibilities to do this: # the simple one: Subclass SegmentReader and make a RWSegmentReader that is only used by IndexWriter/BufferedDeletes/... DirectoryReader will only use the read-only SegmentReader. This would move all TODOs to a separate class. It's reopen/clone method would always create a RO-SegmentReader (for NRT). # Remove all write and commit stuff from SegmentReader completely and move it to IndexWriter's readerPool (it must be in readerPool as deletions need a not-changing view on an index snapshot). Unfortunately the code is so complicated and I have no real experience in those internals of IndexWriter so I did not want to do it with LUCENE-3606, I just separated the code in SegmentReader and marked with TODO. Maybe Mike McCandless can help :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3631) Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/...
[ https://issues.apache.org/jira/browse/LUCENE-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174012#comment-13174012 ] Uwe Schindler commented on LUCENE-3631: --- Heavy committed at revision: 1221677 Remove write access from SegmentReader and possibly move to separate class or IndexWriter/BufferedDeletes/... - Key: LUCENE-3631 URL: https://issues.apache.org/jira/browse/LUCENE-3631 Project: Lucene - Java Issue Type: Task Components: core/index Affects Versions: 4.0 Reporter: Uwe Schindler Assignee: Michael McCandless Attachments: LUCENE-3631-threadlocals.patch, LUCENE-3631.patch, LUCENE-3631.patch After LUCENE-3606 is finished, there are some TODOs: SegmentReader still contains (package-private) all delete logic including crazy copyOnWrite for validDocs Bits. It would be good, if SegmentReader itsself could be read-only like all other IndexReaders. There are two possibilities to do this: # the simple one: Subclass SegmentReader and make a RWSegmentReader that is only used by IndexWriter/BufferedDeletes/... DirectoryReader will only use the read-only SegmentReader. This would move all TODOs to a separate class. It's reopen/clone method would always create a RO-SegmentReader (for NRT). # Remove all write and commit stuff from SegmentReader completely and move it to IndexWriter's readerPool (it must be in readerPool as deletions need a not-changing view on an index snapshot). Unfortunately the code is so complicated and I have no real experience in those internals of IndexWriter so I did not want to do it with LUCENE-3606, I just separated the code in SegmentReader and marked with TODO. Maybe Mike McCandless can help :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-tests-only-trunk - Build # 11853 - Failure
I can't reproduce this one... Mike McCandless http://blog.mikemccandless.com On Wed, Dec 21, 2011 at 3:45 AM, Apache Jenkins Server jenk...@builds.apache.org wrote: Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/11853/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock Error Message: null Stack Trace: junit.framework.AssertionFailedError: at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) at org.apache.lucene.index.TestIndexWriter.testThreadInterruptDeadlock(TestIndexWriter.java:1270) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528) Build Log (for compile errors): [...truncated 1335 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174022#comment-13174022 ] Uwe Schindler commented on LUCENE-3663: --- This looks strange and creates useless objects: {code:java} final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); CharBuffer cb = CharBuffer.wrap(buffer, 0, length); try { PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry); {code} should be: {code:java} try { PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry); {code} Ideally, PhoneNumberUtil would take CharSequence, but unfortunately Google's lib is too stupid to use a more generic Java type. Otherwise patch looks fine, but it adds another external reference. You should make all fields final, they will never change! Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174022#comment-13174022 ] Uwe Schindler edited comment on LUCENE-3663 at 12/21/11 11:34 AM: -- This looks strange and creates useless objects: {code:java} final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); CharBuffer cb = CharBuffer.wrap(buffer, 0, length); try { PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry); {code} should be: {code:java} try { PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry); {code} Ideally, PhoneNumberUtil would take CharSequence (so you could directly pass termAtt without toString()), but unfortunately Google's lib is too stupid to use a more generic Java type. Otherwise patch looks fine, but it adds another external library. You should make all fields final, they will never change! was (Author: thetaphi): This looks strange and creates useless objects: {code:java} final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); CharBuffer cb = CharBuffer.wrap(buffer, 0, length); try { PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry); {code} should be: {code:java} try { PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry); {code} Ideally, PhoneNumberUtil would take CharSequence (so you could directly pass termAtt without toString()), but unfortunately Google's lib is too stupid to use a more generic Java type. Otherwise patch looks fine, but it adds another external reference. You should make all fields final, they will never change! Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174022#comment-13174022 ] Uwe Schindler edited comment on LUCENE-3663 at 12/21/11 11:33 AM: -- This looks strange and creates useless objects: {code:java} final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); CharBuffer cb = CharBuffer.wrap(buffer, 0, length); try { PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry); {code} should be: {code:java} try { PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry); {code} Ideally, PhoneNumberUtil would take CharSequence (so you could directly pass termAtt without toString()), but unfortunately Google's lib is too stupid to use a more generic Java type. Otherwise patch looks fine, but it adds another external reference. You should make all fields final, they will never change! was (Author: thetaphi): This looks strange and creates useless objects: {code:java} final char[] buffer = termAtt.buffer(); final int length = termAtt.length(); CharBuffer cb = CharBuffer.wrap(buffer, 0, length); try { PhoneNumber pn = pnu.parse(cb.toString(), defaultCountry); {code} should be: {code:java} try { PhoneNumber pn = pnu.parse(termAtt.toString(), defaultCountry); {code} Ideally, PhoneNumberUtil would take CharSequence, but unfortunately Google's lib is too stupid to use a more generic Java type. Otherwise patch looks fine, but it adds another external reference. You should make all fields final, they will never change! Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174024#comment-13174024 ] Uwe Schindler commented on LUCENE-3663: --- One more thing, as you want to filter out tokens, you should not subclass TokenFilter directly but instead sublass org.apache.lucene.analysis.util.FilteringTokenFilter and do the work in the match() method. You are free to modify the token there, too. This new base class would correctly handle position increments, as noted as TODO in your comments. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174024#comment-13174024 ] Uwe Schindler edited comment on LUCENE-3663 at 12/21/11 11:39 AM: -- One more thing, as you want to filter out tokens, you should not subclass TokenFilter directly but instead sublass org.apache.lucene.analysis.util.FilteringTokenFilter and do the work in the accept() method. You are free to modify the token there, too. This new base class would correctly handle position increments, as noted as TODO in your comments. was (Author: thetaphi): One more thing, as you want to filter out tokens, you should not subclass TokenFilter directly but instead sublass org.apache.lucene.analysis.util.FilteringTokenFilter and do the work in the match() method. You are free to modify the token there, too. This new base class would correctly handle position increments, as noted as TODO in your comments. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174025#comment-13174025 ] Michael McCandless commented on LUCENE-3663: +1 I think this would be a useful addition. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174026#comment-13174026 ] Robert Muir commented on LUCENE-3663: - I think actually that we should not remove tokens that aren't phone numbers. sometimes there just might be other things instead of phone numbers, or maybe the phone number detection/normalization is just imperfect so its better to not throw away, instead just no normalization happens, like a stemmer. In general we can also assume the text is unstructured and might have other stuff (this implies someone has a super-cool tokenizer that doesnt split up any dirty phone numbers, but we just leave the possibility) Then i think the while loop could be removed, if the phone number normalization succeeds mark the type as phone. Otherwise in the exception case, output it unchanged. then non-phonenumbers or whatever can be easily filtered out separately with a subclass of FilteringTokenFilter. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2980) DataImportHandler becomes unresponsive with Microsoft JDBC driver
[ https://issues.apache.org/jira/browse/SOLR-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174042#comment-13174042 ] Steve Wolfe commented on SOLR-2980: --- After careful comparison of the working vs. non-working machines I identified that the non-working machines were using a slightly newer build of the JRE (both were using 1.6.0_20, but two different builds of that same runtime). By explicitly installing the older version all issues went away. During diagnostics I had also found that the issue was not specific to Solr, but rather appeared to be between the affected JRE and the SQL Server JDBC driver. Good build of the JRE: http://pkgs.org/centos-5-rhel-5/centos-rhel-x86_64/java-1.6.0-openjdk-1.6.0.0-1.22.1.9.8.el5_6.x86_64.rpm.html Bad build of the JRE: http://pkgs.org/centos-5-rhel-5/centos-rhel-updates-x86_64/java-1.6.0-openjdk-src-1.6.0.0-1.23.1.9.10.el5_7.x86_64.rpm.html DataImportHandler becomes unresponsive with Microsoft JDBC driver - Key: SOLR-2980 URL: https://issues.apache.org/jira/browse/SOLR-2980 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.4, 3.5 Environment: Java JRE 1.6.0_20, JRE 1.6.0_29, CentOS (kernel 2.6.18-274.3.1.e15), Microsoft SQL Server JDBC Driver 3.0 Reporter: Steve Wolfe Labels: dataimport, jdbc, solr, sql, sqlserver A solr core has been configured to use the DataImportHandler to read a set of documents from a Microsoft SQL Server database, via the Microsoft JDBC driver. A known-good configuration for the data import handler is used, and a reload-config followed by full-import command are issued to the DataImportHandler. The handler switches to a status of A command is still running..., and shows 1 request has been made to the data source. Subsequent status calls show the Time Elapsed growing, but the handler fails to perform any action--SQL Server confirms that a login event occurs, but no queries are issued. Solr does not throw any exceptions, even after a very long duration. The last message in Solr's output is INFO: Creating a connection for entity {entity name} with URL: {entity datasource url} Attempts to issue an Abort command to the DataImportHandler appear successful, but do no stop the operation. Running the solr instance with the java -verbose flag shows the following: *IMMEDIATELY UPON EXECUTING FULL-IMPORT COMMAND* [Loaded com.microsoft.sqlserver.jdbc.StreamPacket from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] [Loaded com.microsoft.sqlserver.jdbc.StreamLoginAck from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] [Loaded com.microsoft.sqlserver.jdbc.StreamDone from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] *APPROXIMATELY 40 SECONDS LATER* [Loaded java.io.InterruptedIOException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] [Loaded java.net.SocketTimeoutException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] [Loaded sun.net.ConnectionResetException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] An issue with identical symptoms has been reported on StackOverflow (the OP found that using a 3rd party JDBC driver appeared successful): http://stackoverflow.com/questions/8269038/solr-dataimporthandler-logs-into-sql-but-never-fetches-any-data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (SOLR-2980) DataImportHandler becomes unresponsive with Microsoft JDBC driver
[ https://issues.apache.org/jira/browse/SOLR-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Wolfe closed SOLR-2980. - Resolution: Not A Problem Determined that the issue is not Solr-specific, but rather it occurs between affected versions/builds of the JRE and the MS SQL JDBC driver. See comment for details. DataImportHandler becomes unresponsive with Microsoft JDBC driver - Key: SOLR-2980 URL: https://issues.apache.org/jira/browse/SOLR-2980 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.4, 3.5 Environment: Java JRE 1.6.0_20, JRE 1.6.0_29, CentOS (kernel 2.6.18-274.3.1.e15), Microsoft SQL Server JDBC Driver 3.0 Reporter: Steve Wolfe Labels: dataimport, jdbc, solr, sql, sqlserver A solr core has been configured to use the DataImportHandler to read a set of documents from a Microsoft SQL Server database, via the Microsoft JDBC driver. A known-good configuration for the data import handler is used, and a reload-config followed by full-import command are issued to the DataImportHandler. The handler switches to a status of A command is still running..., and shows 1 request has been made to the data source. Subsequent status calls show the Time Elapsed growing, but the handler fails to perform any action--SQL Server confirms that a login event occurs, but no queries are issued. Solr does not throw any exceptions, even after a very long duration. The last message in Solr's output is INFO: Creating a connection for entity {entity name} with URL: {entity datasource url} Attempts to issue an Abort command to the DataImportHandler appear successful, but do no stop the operation. Running the solr instance with the java -verbose flag shows the following: *IMMEDIATELY UPON EXECUTING FULL-IMPORT COMMAND* [Loaded com.microsoft.sqlserver.jdbc.StreamPacket from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] [Loaded com.microsoft.sqlserver.jdbc.StreamLoginAck from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] [Loaded com.microsoft.sqlserver.jdbc.StreamDone from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] *APPROXIMATELY 40 SECONDS LATER* [Loaded java.io.InterruptedIOException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] [Loaded java.net.SocketTimeoutException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] [Loaded sun.net.ConnectionResetException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] An issue with identical symptoms has been reported on StackOverflow (the OP found that using a 3rd party JDBC driver appeared successful): http://stackoverflow.com/questions/8269038/solr-dataimporthandler-logs-into-sql-but-never-fetches-any-data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2980) DataImportHandler becomes unresponsive with Microsoft JDBC driver
[ https://issues.apache.org/jira/browse/SOLR-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174044#comment-13174044 ] Uwe Schindler commented on SOLR-2980: - See my comments about the mess with OpenJDK version numbers, you cannot read anything out of it: My advise: Don't use OpenJDK and download real Oracle JDKs - please! http://blog.thetaphi.de/2011/12/jdk-7u2-released-how-about-linux-and.html DataImportHandler becomes unresponsive with Microsoft JDBC driver - Key: SOLR-2980 URL: https://issues.apache.org/jira/browse/SOLR-2980 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 3.4, 3.5 Environment: Java JRE 1.6.0_20, JRE 1.6.0_29, CentOS (kernel 2.6.18-274.3.1.e15), Microsoft SQL Server JDBC Driver 3.0 Reporter: Steve Wolfe Labels: dataimport, jdbc, solr, sql, sqlserver A solr core has been configured to use the DataImportHandler to read a set of documents from a Microsoft SQL Server database, via the Microsoft JDBC driver. A known-good configuration for the data import handler is used, and a reload-config followed by full-import command are issued to the DataImportHandler. The handler switches to a status of A command is still running..., and shows 1 request has been made to the data source. Subsequent status calls show the Time Elapsed growing, but the handler fails to perform any action--SQL Server confirms that a login event occurs, but no queries are issued. Solr does not throw any exceptions, even after a very long duration. The last message in Solr's output is INFO: Creating a connection for entity {entity name} with URL: {entity datasource url} Attempts to issue an Abort command to the DataImportHandler appear successful, but do no stop the operation. Running the solr instance with the java -verbose flag shows the following: *IMMEDIATELY UPON EXECUTING FULL-IMPORT COMMAND* [Loaded com.microsoft.sqlserver.jdbc.StreamPacket from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] [Loaded com.microsoft.sqlserver.jdbc.StreamLoginAck from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] [Loaded com.microsoft.sqlserver.jdbc.StreamDone from file:/home/MYWEBGROCER/swolfe/downloads/apache-solr-3.5.0/example/lib/sqljdbc4.jar] *APPROXIMATELY 40 SECONDS LATER* [Loaded java.io.InterruptedIOException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] [Loaded java.net.SocketTimeoutException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] [Loaded sun.net.ConnectionResetException from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/rt.jar] An issue with identical symptoms has been reported on StackOverflow (the OP found that using a 3rd party JDBC driver appeared successful): http://stackoverflow.com/questions/8269038/solr-dataimporthandler-logs-into-sql-but-never-fetches-any-data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174046#comment-13174046 ] Santiago M. Mola commented on LUCENE-3663: -- @Uwe: Thanks for the comments. @Robert: Then this filter would mark phone tokens as PHONE type and I could filter non-PHONE tokens with a subsequent filter? In my specific use case, I need to throw away any token that could not be normalized, so I have to, at least, mark phone tokens for removal in further steps. If tokens are not marked, then we would have to check twice if the token is a valid phone. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174058#comment-13174058 ] Santiago M. Mola commented on LUCENE-3663: -- Bug report for libphonenumber in order to get it to support CharSequence: https://code.google.com/p/libphonenumber/issues/detail?id=84 Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174074#comment-13174074 ] Robert Muir commented on LUCENE-3663: - Santiago, yeah i think if normalization is successful, you would change the type to PHONE as it was recognized as one. otherwise when you get the exception, just 'return true' and leave all attributes unchanged. in the successful case, besides setting the type, if you wanted you could even not throw away the PhoneNumber or whatever but instead put it in an attribute. This way if someone wanted to do more complicated stuff the attributes are at least available, but its also useful for things like solr's analysis.jsp just for debugging how the analysis worked. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3663) Add a phone number normalization TokenFilter
[ https://issues.apache.org/jira/browse/LUCENE-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174078#comment-13174078 ] Uwe Schindler commented on LUCENE-3663: --- bq. Then this filter would mark phone tokens as PHONE type and I could filter non-PHONE tokens with a subsequent filter? YES!. The FilteringTokenFilter subclass you then would add after this filterw ould simply has this accept() method: {code:java} protected boolean accept() { return PHONE.equals(typeAtt.getType()); } {code} FilteringTokenFilter would then also support position increments correctly, that your filter does not. Add a phone number normalization TokenFilter Key: LUCENE-3663 URL: https://issues.apache.org/jira/browse/LUCENE-3663 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Santiago M. Mola Priority: Minor Attachments: PhoneFilter.java Phone numbers can be found in the wild in an infinity variety of formats (e.g. with spaces, parenthesis, dashes, with or without country code, with letters in substitution of numbers). So some Lucene applications can benefit of phone normalization with a TokenFilter that gets a phone number in any format, and outputs it in a standard format, using a default country to guess country code if it's not present. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson reassigned SOLR-2242: Assignee: Erick Erickson (was: Simon Willnauer) Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Erick Erickson Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174152#comment-13174152 ] Erick Erickson commented on SOLR-2242: -- OK, it seems like we have several themes here. I'd like to get a reasonable consensus before going forward... I'll put out a straw-man proposal here and we can go from there. But lets figure out where we're going before revamping stuff yet again. 1 Distributed support. I sure don't see a good way to support this currently. Perhaps some of the future enhancements will make this easier (thinking distributed TF/IDF such while being totally ignorant of that code), but returning the entire list of constraints (or names or terms or whatever we call it) is just a bad idea. The first time someone tries this on a field with 1,000,000 terms (yes, I've seen this) it'll just blow things up. I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone there are between 10,000 and 90,000 distinct values. And if it's a field with just a few pre-defined values, that information is already known anyway But if someone can show a use-case here I'm not completely against it. But I'd like to see the use case first, not someone might find it useful G. 2 back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? 3 Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here. Again, I want a clearly defined end point for the concerns raised before we dive back in here Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Simon Willnauer Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174152#comment-13174152 ] Erick Erickson edited comment on SOLR-2242 at 12/21/11 3:45 PM: OK, it seems like we have several themes here. I'd like to get a reasonable consensus before going forward... I'll put out a straw-man proposal here and we can go from there. But lets figure out where we're going before revamping stuff yet again. 1 Distributed support. I sure don't see a good way to support this currently. Perhaps some of the future enhancements will make this easier (thinking distributed TF/IDF such while being totally ignorant of that code), but returning the entire list of constraints (or names or terms or whatever we call it) is just a bad idea. The first time someone tries this on a field with 1,000,000 terms (yes, I've seen this) it'll just blow things up. I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone there are between 10,000 and 90,000 distinct values. And if it's a field with just a few pre-defined values, that information is already known anyway But if someone can show a use-case here I'm not completely against it. But I'd like to see the use case first, not someone might find it useful G. 2 back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? Or does this mess up JSON, PHP, etc? 3 Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here. Again, I want a clearly defined end point for the concerns raised before we dive back in here was (Author: erickerickson): OK, it seems like we have several themes here. I'd like to get a reasonable consensus before going forward... I'll put out a straw-man proposal here and we can go from there. But lets figure out where we're going before revamping stuff yet again. 1 Distributed support. I sure don't see a good way to support this currently. Perhaps some of the future enhancements will make this easier (thinking distributed TF/IDF such while being totally ignorant of that code), but returning the entire list of constraints (or names or terms or whatever we call it) is just a bad idea. The first time someone tries this on a field with 1,000,000 terms (yes, I've seen this) it'll just blow things up. I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone there are between 10,000 and 90,000 distinct values. And if it's a field with just a few pre-defined values, that information is already known anyway But if someone can show a use-case here I'm not completely against it. But I'd like to see the use case first, not someone might find it useful G. 2 back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? 3 Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here. Again, I want a clearly defined end point for the concerns raised before we dive back in here Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Erick Erickson Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only
[jira] [Commented] (SOLR-2804) Logging error causes entire DIH process to fail
[ https://issues.apache.org/jira/browse/SOLR-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174156#comment-13174156 ] Michael Haeusler commented on SOLR-2804: This problem also occurs with Solr 3.5.0. The stacktrace is almost identical: Dec 20, 2011 11:22:36 AM org.apache.solr.common.SolrException log SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.solr.common.util.NamedList.getName(NamedList.java:127) at org.apache.solr.common.util.NamedList.toString(NamedList.java:253) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188) at org.apache.solr.handler.dataimport.SolrWriter.finish(SolrWriter.java:133) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:213) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) Logging error causes entire DIH process to fail --- Key: SOLR-2804 URL: https://issues.apache.org/jira/browse/SOLR-2804 Project: Solr Issue Type: Bug Components: contrib - DataImportHandler Affects Versions: 4.0 Environment: java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-384-10M3425) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-384, mixed mode) Model Name: MacBook Pro Model Identifier: MacBookPro8,2 Processor Name: Intel Core i7 Processor Speed:2.2 GHz Number of Processors: 1 Total Number of Cores: 4 L2 Cache (per Core):256 KB L3 Cache: 6 MB Memory: 4 GB System Software Overview: System Version: Mac OS X 10.6.8 (10K549) Kernel Version: Darwin 10.8.0 Reporter: Pulkit Singhal Labels: dih Original Estimate: 48h Remaining Estimate: 48h SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.solr.common.util.NamedList.getName(NamedList.java:127) at org.apache.solr.common.util.NamedList.toString(NamedList.java:263) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188) at org.apache.solr.handler.dataimport.SolrWriter.close(SolrWriter.java:57) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:265) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:372) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:440) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:421) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174179#comment-13174179 ] Jonathan Rochkind commented on SOLR-2242: - I would find this feature valuable even if it simply did not work at all on a distributed index. (Refusing to return a value rather than returning a known incorrect value would seem like the right way to go). Because my index is not distributed, and I would find this feature valuable, heh. I don't know if Solr currently has any policies against committing features that can't work on distributed, but personally my 'vote' would be doing that here, with clear documentation that it doesn't work on distributed (and the hope that future enhancements may make it more feasible to do so, as Erick suggests may possibly maybe happen). Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Erick Erickson Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174184#comment-13174184 ] Yonik Seeley commented on SOLR-2242: bq. I'm also slightly anti the min/max idea. I'm not sure what value there is in telling someone there are between 10,000 and 90,000 distinct values. I think we could come up with a pretty good estimate (but we should tell them it's an estimate somehow). Anyway, that could optionally be handled in a different issue. bq. 2 back compat. Cody's suggestion seems to be the slickest in terms of not breaking things, but we use attributes in just a few places, are there reasons NOT to do it that way? Or does this mess up JSON, PHP, etc? Yes, it messes up JSON, binary format, etc. We'd need to figure out how to add attributes into our data model (that gets sent to response writers) in a generic way. bq. 3 Possibly add a new JIRA for changing the facet response format to be tolerant of sub-fields, but don't do that here. Not sure how that's possible... it's either more magic field names in with the individual constraints, or the facet response format has got to change. Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Erick Erickson Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2950) QueryElevationComponent needlessly looks up document ids
[ https://issues.apache.org/jira/browse/SOLR-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-2950: --- Attachment: SOLR-2950.patch OK, just had a chance to view the comparator part of this patch. Here's a patch that fixes - minor check-for-null for fields() and terms() which can return null - even though docsEnum returns something, it may be deleted (i.e. need to check for NO_MORE_DOCS) - use liveDocs when requesting the docsEnum so we won't use a deleted (overwritten) doc. The last two issues would both cause us to miss elevated documents if they have been updated and an old deleted version still exists in the index. QueryElevationComponent needlessly looks up document ids Key: SOLR-2950 URL: https://issues.apache.org/jira/browse/SOLR-2950 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 4.0 Attachments: SOLR-2950.patch, SOLR-2950.patch, SOLR-2950.patch, SOLR-2950.patch The QueryElevationComponent needlessly instantiates a FieldCache and does look ups in it for every document. If we flipped things around a bit and got Lucene internal doc ids on inform() we could then simply do a much smaller and faster lookup during the sort. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2950) QueryElevationComponent needlessly looks up document ids
[ https://issues.apache.org/jira/browse/SOLR-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174258#comment-13174258 ] Grant Ingersoll commented on SOLR-2950: --- +1, go ahead and commit. QueryElevationComponent needlessly looks up document ids Key: SOLR-2950 URL: https://issues.apache.org/jira/browse/SOLR-2950 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 4.0 Attachments: SOLR-2950.patch, SOLR-2950.patch, SOLR-2950.patch, SOLR-2950.patch The QueryElevationComponent needlessly instantiates a FieldCache and does look ups in it for every document. If we flipped things around a bit and got Lucene internal doc ids on inform() we could then simply do a much smaller and faster lookup during the sort. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2977) QueryElevationComponent should support fake excludes
[ https://issues.apache.org/jira/browse/SOLR-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-2977: - Assignee: Grant Ingersoll QueryElevationComponent should support fake excludes -- Key: SOLR-2977 URL: https://issues.apache.org/jira/browse/SOLR-2977 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor It would be handy to be able to, in the QEC, simply mark documents as excluded instead of completely excluding them. This can be achieved using the EditorialMarker that was recently added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2984) Function query does not work when when the value of function parameter has null.
Function query does not work when when the value of function parameter has null. Key: SOLR-2984 URL: https://issues.apache.org/jira/browse/SOLR-2984 Project: Solr Issue Type: Bug Components: SearchComponents - other Affects Versions: 3.3 Reporter: Pradeep Priority: Minor To reproduce, sort parameter in query looks like sort=sum(product(Rating,0.01),product(recip(ms(NOW/HOUR,Date),3.16e-11,1,1),0.04)) desc and if Rating column in the database has null values, results are not sorted according to the output value of the function. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2984) Function query does not work when the value of function parameter has null.
[ https://issues.apache.org/jira/browse/SOLR-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep updated SOLR-2984: -- Summary: Function query does not work when the value of function parameter has null. (was: Function query does not work when when the value of function parameter has null.) Function query does not work when the value of function parameter has null. --- Key: SOLR-2984 URL: https://issues.apache.org/jira/browse/SOLR-2984 Project: Solr Issue Type: Bug Components: SearchComponents - other Affects Versions: 3.3 Reporter: Pradeep Priority: Minor To reproduce, sort parameter in query looks like sort=sum(product(Rating,0.01),product(recip(ms(NOW/HOUR,Date),3.16e-11,1,1),0.04)) desc and if Rating column in the database has null values, results are not sorted according to the output value of the function. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-2242: - Attachment: SOLR-2242.patch First step in resurrecting this. This patch should apply cleanly to trunk. It incorporates the SOLR-2242.patch from 28-June and the NmFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem to have been moved around a bit. Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Erick Erickson Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174441#comment-13174441 ] Erick Erickson edited comment on SOLR-2242 at 12/21/11 9:51 PM: First step in resurrecting this. This patch should apply cleanly to trunk. It incorporates the SOLR-2242.patch from 28-June and the NumFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem to have been moved around a bit. All I guarantee is that the code compiles and the NumFacetTermsFacetsTest runs from inside IntelliJ. was (Author: erickerickson): First step in resurrecting this. This patch should apply cleanly to trunk. It incorporates the SOLR-2242.patch from 28-June and the NumFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem to have been moved around a bit. Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Erick Erickson Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174441#comment-13174441 ] Erick Erickson edited comment on SOLR-2242 at 12/21/11 9:50 PM: First step in resurrecting this. This patch should apply cleanly to trunk. It incorporates the SOLR-2242.patch from 28-June and the NumFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem to have been moved around a bit. was (Author: erickerickson): First step in resurrecting this. This patch should apply cleanly to trunk. It incorporates the SOLR-2242.patch from 28-June and the NmFacetTermsFacetsTest from 9-July. It accounts for the fact that things seem to have been moved around a bit. Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Assignee: Erick Erickson Priority: Minor Fix For: 4.0 Attachments: NumFacetTermsFacetsTest.java, SOLR-2242-notworkingtest.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.patch, SOLR-2242.shard.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR.2242.v2.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2906) Implement LFU Cache
[ https://issues.apache.org/jira/browse/SOLR-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174475#comment-13174475 ] Shawn Heisey commented on SOLR-2906: I must be dense. I can figure out how to add the timeDecay option, but I can't figure out what section of code to enable/disable based on the value of timeDecay. I've gone as far as doing a diff on my Nov 24th patch and the Dec 20th patch from Erick. (doing diffs on diffs ... the world is going to explode!) The only differences I can see between the two is in whitespace/formatting. Implement LFU Cache --- Key: SOLR-2906 URL: https://issues.apache.org/jira/browse/SOLR-2906 Project: Solr Issue Type: Sub-task Components: search Affects Versions: 3.4 Reporter: Shawn Heisey Assignee: Erick Erickson Priority: Minor Attachments: ConcurrentLFUCache.java, LFUCache.java, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, TestLFUCache.java Implement an LFU (Least Frequently Used) cache as the first step towards a full ARC cache -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2977) QueryElevationComponent should support fake excludes
[ https://issues.apache.org/jira/browse/SOLR-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-2977: -- Attachment: SOLR-2977.patch first draft. QueryElevationComponent should support fake excludes -- Key: SOLR-2977 URL: https://issues.apache.org/jira/browse/SOLR-2977 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-2977.patch It would be handy to be able to, in the QEC, simply mark documents as excluded instead of completely excluding them. This can be achieved using the EditorialMarker that was recently added. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-tests-only-trunk-java7 - Build # 1311 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk-java7/1311/ 1 tests failed. REGRESSION: org.apache.solr.search.TestRealTimeGet.testStressGetRealtime Error Message: java.lang.AssertionError: Some threads threw uncaught exceptions! Stack Trace: java.lang.RuntimeException: java.lang.AssertionError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:657) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:86) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) at org.apache.lucene.util.LuceneTestCase.checkUncaughtExceptionsAfter(LuceneTestCase.java:685) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:629) Build Log (for compile errors): [...truncated 11794 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2906) Implement LFU Cache
[ https://issues.apache.org/jira/browse/SOLR-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-2906: - Attachment: SOLR-2906.patch Here's what I had in mind, at least I *think* this will do but all I've done is insured that the code compiles and the current LFU test suite runs. Look in the diff for timeDecay. This still needs some proof that the new parameter comes through from a schema file. Let me know if that presents a problem or if you can't get 'round to it, I might have some time over Christmas. I think maybe you were under the impression that this had already been done and were looking for it to be in the code already? Implement LFU Cache --- Key: SOLR-2906 URL: https://issues.apache.org/jira/browse/SOLR-2906 Project: Solr Issue Type: Sub-task Components: search Affects Versions: 3.4 Reporter: Shawn Heisey Assignee: Erick Erickson Priority: Minor Attachments: ConcurrentLFUCache.java, LFUCache.java, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, SOLR-2906.patch, TestLFUCache.java Implement an LFU (Least Frequently Used) cache as the first step towards a full ARC cache -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2841) Scriptable UpdateRequestChain
[ https://issues.apache.org/jira/browse/SOLR-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174566#comment-13174566 ] Lance Norskog commented on SOLR-2841: - +1 Another use case for scripting at the top level is multi-query queries: where the app creates the second based on the first. Would your proposal handle this problem? Many use cases for grouping/collapsing can be implemented with 2 queries. Perhaps the guts of collapsing could be simplified if the more outré use cases could be pushed out into multiple queries. Scriptable UpdateRequestChain - Key: SOLR-2841 URL: https://issues.apache.org/jira/browse/SOLR-2841 Project: Solr Issue Type: New Feature Components: update Reporter: Jan Høydahl UpdateProcessorChains must currently be defined with XML in solrconfig.xml. We should explore a scriptable chain implementation with a DSL that allows for full flexibility. The first step would be to make UpdateChain implementations pluggable in solrconfig.xml, for backward compat support. Benefits and possibilities with a Scriptable UpdateChain: * A compact DSL for defining Processors and Chains (Workflows would be a better, less limited term here) * Keeping update processor config separate from solrconfig.xml gives better separations of roles * Use this as an opportunity to natively support scripting language Processors (ideas from SOLR-1725) This issue is spun off from SOLR-2823. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: highlight bug
(11/12/21 4:50), Celso Oliveira wrote: Hey guys, I'm having a little problem with solr 3.4.0, when I turn on the highlight: Its looks like that: https://issues.apache.org/jira/browse/SOLR-925 (fixed in 1.4 version) But now, on 3.4.0 version, I still get this error. Your problem is not same for SOLR-925, because you use FVH. Please open a ticket with the following info: - schema.xml (field type and field of hl.fl) - request url - document data thanks! koji -- http://www.rondhuit.com/en/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
[ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174637#comment-13174637 ] Shinichiro Abe commented on SOLR-2346: -- I've faced the same problem. Tika parsed my Shift_JIS file as windows-1252, I could not see the desired results. I can index the file correctly by applying Koji's patch. But this patch is effective for remote streaming, not for POST. So, I changed a part of code below. {noformat} //String charset = ContentStreamBase.getCharsetFromContentType(stream.getContentType()); String contentType = req.getParams().get(CommonParams.STREAM_CONTENTTYPE, null); String charset = ContentStreamBase.getCharsetFromContentType(contentType); {noformat} Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly. --- Key: SOLR-2346 URL: https://issues.apache.org/jira/browse/SOLR-2346 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 1.4.1, 3.1, 4.0 Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1, Machine was booted in Japanese Locale. Reporter: Prasad Deshpande Assignee: Koji Sekiguchi Priority: Critical Fix For: 3.6, 4.0 Attachments: NormalSave.msg, SOLR-2346.patch, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding like Big5 for Japanese I could not see the desired results. The contents after indexing looked garbled for Big5 encoded document when I searched for all indexed documents. When I index attached non utf-8 file it indexes in following way - result name=response numFound=1 start=0 - doc - arr name=attr_content str�� ��/str /arr - arr name=attr_content_encoding strBig5/str /arr - arr name=attr_content_language strzh/str /arr - arr name=attr_language strzh/str /arr - arr name=attr_stream_size str17/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc /result /response Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed in Big5 encoding. Here I tried fetching indexed data stream in Big5 and converted in UTF8. String id = (String) resulDocument.getFirstValue(attr_content); byte[] bytearray = id.getBytes(Big5); String utf8String = new String(bytearray, UTF-8); It does not gives expected results. When I index UTF-8 file it indexes like following - doc - arr name=attr_content strマイ ネットワーク/str /arr - arr name=attr_content_encoding strUTF-8/str /arr - arr name=attr_stream_content_type strtext/plain/str /arr - arr name=attr_stream_name strsample_jap_unicode.txt/str /arr - arr name=attr_stream_size str28/str /arr - arr name=attr_stream_source_info strmyfile/str /arr - arr name=content_type strtext/plain/str /arr str name=iddoc2/str /doc So, I can index and search UTF-8 data. For more reference below is the discussion with Yonik. Please find attached TXT file which I was using to index and search. curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=truecharset=utf-8; -F myfile=@sample_jap_non_UTF-8 One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8. Here's one way to actually tell solr what the encoding of the text you are sending is: curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentfmap.div=foo_tboost.foo_t=3commit=true; --data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5' Now the problem appears that for some reason, this doesn't work... Could you open a JIRA issue and attach your two test files? -Yonik http://lucidimagination.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org