[Lucene.Net] [jira] [Commented] (LUCENENET-426) Mark BaseFragmentsBuilder methods as virtual
[ https://issues.apache.org/jira/browse/LUCENENET-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053692#comment-13053692 ] Itamar Syn-Hershko commented on LUCENENET-426: -- I think we already went through this? Anyway, I don't see how that's relevant. My point here is . As for my particular use case - I'm trying to highlight based on termvectors of field A using the stored content from field B. I don't think I'm doing anything wrong here. Only spent 10 minutes on that tho. Mark BaseFragmentsBuilder methods as virtual Key: LUCENENET-426 URL: https://issues.apache.org/jira/browse/LUCENENET-426 Project: Lucene.Net Issue Type: Improvement Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 3.x, Lucene.Net 2.9.4g Reporter: Itamar Syn-Hershko Priority: Minor Fix For: Lucene.Net 2.9.4, Lucene.Net 2.9.4g Attachments: fvh.patch Without marking methods in BaseFragmentsBuilder as virtual, it is meaningless to have FragmentsBuilder deriving from a class named Base, since most of its functionality cannot be overridden. Attached is a patch for marking the important methods virtual. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Issue Comment Edited] (LUCENENET-426) Mark BaseFragmentsBuilder methods as virtual
[ https://issues.apache.org/jira/browse/LUCENENET-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053692#comment-13053692 ] Itamar Syn-Hershko edited comment on LUCENENET-426 at 6/23/11 8:02 AM: --- I think we already went through this? Anyway, I don't see how that's relevant. My point here is language difference you didn't take into account, and that point is valid. As for my particular use case - I'm trying to highlight based on termvectors of field A using the stored content from field B. I don't think I'm doing anything wrong here. Only spent 10 minutes on that tho. was (Author: itamar): I think we already went through this? Anyway, I don't see how that's relevant. My point here is . As for my particular use case - I'm trying to highlight based on termvectors of field A using the stored content from field B. I don't think I'm doing anything wrong here. Only spent 10 minutes on that tho. Mark BaseFragmentsBuilder methods as virtual Key: LUCENENET-426 URL: https://issues.apache.org/jira/browse/LUCENENET-426 Project: Lucene.Net Issue Type: Improvement Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 3.x, Lucene.Net 2.9.4g Reporter: Itamar Syn-Hershko Priority: Minor Fix For: Lucene.Net 2.9.4, Lucene.Net 2.9.4g Attachments: fvh.patch Without marking methods in BaseFragmentsBuilder as virtual, it is meaningless to have FragmentsBuilder deriving from a class named Base, since most of its functionality cannot be overridden. Attached is a patch for marking the important methods virtual. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[Lucene.Net] [jira] [Commented] (LUCENENET-426) Mark BaseFragmentsBuilder methods as virtual
[ https://issues.apache.org/jira/browse/LUCENENET-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053722#comment-13053722 ] Digy commented on LUCENENET-426: 10 min. work done. DIGY Mark BaseFragmentsBuilder methods as virtual Key: LUCENENET-426 URL: https://issues.apache.org/jira/browse/LUCENENET-426 Project: Lucene.Net Issue Type: Improvement Components: Lucene.Net Contrib Affects Versions: Lucene.Net 2.9.2, Lucene.Net 2.9.4, Lucene.Net 3.x, Lucene.Net 2.9.4g Reporter: Itamar Syn-Hershko Priority: Minor Fix For: Lucene.Net 2.9.4, Lucene.Net 2.9.4g Attachments: fvh.patch Without marking methods in BaseFragmentsBuilder as virtual, it is meaningless to have FragmentsBuilder deriving from a class named Base, since most of its functionality cannot be overridden. Attached is a patch for marking the important methods virtual. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[JENKINS] Solr-3.x - Build # 388 - Failure
Build: https://builds.apache.org/job/Solr-3.x/388/ No tests ran. Build Log (for compile errors): [...truncated 15801 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term
[ https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053675#comment-13053675 ] Simon Willnauer commented on LUCENE-3225: - Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here. Can we somehow leave the extra CPU work to the term() call and make this entirely lazy? Optimize TermsEnum.seek when caller doesn't need next term -- Key: LUCENE-3225 URL: https://issues.apache.org/jira/browse/LUCENE-3225 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3225.patch Some codecs are able to save CPU if the caller is only interested in exact matches. EG, Memory codec and SimpleText can do more efficient FSTEnum lookup if they know the caller doesn't need to know the term following the seek term. We have cases like this in Lucene, eg when IW deletes documents by Term, if the term is not found in a given segment then it doesn't need to know the ceiling term. Likewise when TermQuery looks up the term in each segment. I had done this change as part of LUCENE-3030, which is a new terms index that's able to save seeking for exact-only lookups, but now that we have Memory codec that can also save CPU I think we should commit this today. The change adds a boolean onlyExact param to seek(BytesRef). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053678#comment-13053678 ] Simon Willnauer commented on SOLR-2193: --- bq. Curious; why is the resolution status invalid? well we decided to cut this into two new issues and close this one. see: bq.further developments in SOLR-2565 and SOLR-2566 there have been discussions about the focus here so we made it more clear. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Robert Muir Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053780#comment-13053780 ] Paul Elschot commented on LUCENE-3229: -- To reduce surprises like this one when nested spans are used, the ordered case might be changed to require no overlap at all. To do that one could compare the end of one spans with the beginning of the next one. AFAIK none of the existing test cases uses a nested span query, so more some test cases for that would be good to have. The docSpansOrdered method in NearSpansUnordered from the SpanOverLap2.diff patch is the same as the existing docSpansOrdered method in NearSpansOrdered. That is probably not intended. Could you provide patches as decribed here: http://wiki.apache.org/lucene-java/HowToContribute ? Overlaped SpanNearQuery --- Key: LUCENE-3229 URL: https://issues.apache.org/jira/browse/LUCENE-3229 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.1 Environment: Windows XP, Java 1.6 Reporter: ludovic Boutros Priority: Minor Attachments: SpanOverlap.diff, SpanOverlap2.diff, SpanOverlapTestUnit.diff While using Span queries I think I've found a little bug. With a document like this (from the TestNearSpansOrdered unit test) : w1 w2 w3 w4 w5 If I try to search for this span query : spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true) the above document is returned and I think it should not because 'w4' is not after 'w5'. The 2 spans are not ordered, because there is an overlap. I will add a test patch in the TestNearSpansOrdered unit test. I will add a patch to solve this issue too. Basicaly it modifies the two docSpansOrdered functions to make sure that the spans does not overlap. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex
[ https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053784#comment-13053784 ] Michael McCandless commented on LUCENE-3226: Can we have the constant name be descriptive (reflect what actually changed) and then add a comment expressing the version when that change was made to Lucene? I think we should name them primarily for the benefit of developers working with the source code going forward, and only secondarily for the future developer who will some day need to remove them... rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex Key: LUCENE-3226 URL: https://issues.apache.org/jira/browse/LUCENE-3226 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.1, 3.2 Reporter: Hoss Man Fix For: 3.3, 4.0 Attachments: LUCENE-3226.patch A 3.2 user recently asked if something was wrong because CheckIndex was reporting his (newly built) index version as... {noformat} Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1] {noformat} It seems like there are two very confusing pieces of information here... 1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice. All other FORMAT_* constants in SegmentInfos are descriptive of the actual change made, and not specific to the version when they were introduced. 2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it Lucene 3.1, which is missleading since that format is alwasy used in 3.2 (and probably 3.3, etc...). I suggest: a) rename FORMAT_3_1 to something like FORMAT_SEGMENT_RECORDS_VERSION b) change CheckIndex so that the label for the newest format always ends with and later (ie: Lucene 3.1 and later) so when we release versions w/o a format change we don't have to remember to manual list them in CheckIndex. when we *do* make format changes and update CheckIndex and later can be replaced with to X.Y and the new format can be added -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2305) DataImportScheduler - Marko Bonaci
[ https://issues.apache.org/jira/browse/SOLR-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marko Bonaci updated SOLR-2305: --- Attachment: patch.txt Patch for adding DIHScheduler v1.2 to Solr DataImportScheduler - Marko Bonaci --- Key: SOLR-2305 URL: https://issues.apache.org/jira/browse/SOLR-2305 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Bill Bell Fix For: 4.0 Attachments: patch.txt Marko Bonaci has updated the WIKI page to add the DataImportScheduler, but I cannot find a JIRA ticket for it? http://wiki.apache.org/solr/DataImportHandler Do we have a ticket so the code can be tracked? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-2305) DataImportScheduler - Marko Bonaci
[ https://issues.apache.org/jira/browse/SOLR-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053796#comment-13053796 ] Marko Bonaci edited comment on SOLR-2305 at 6/23/11 11:17 AM: -- This is patch for adding DIHScheduler v1.2 to Solr. I didn't know I could make a patch concerning only org.apache.solr.handler.dataimport package :( So finally, here it is. Since I still have problems with build path/packages in Eclipse: Wasn't tested at all. No unit tests. Whoever will be adding this please feel free to contact me if such a need arises. Also, all criticism is more than welcome, I want to learn to do this the right way. Thanks was (Author: mbonaci): Patch for adding DIHScheduler v1.2 to Solr DataImportScheduler - Marko Bonaci --- Key: SOLR-2305 URL: https://issues.apache.org/jira/browse/SOLR-2305 Project: Solr Issue Type: New Feature Affects Versions: 4.0 Reporter: Bill Bell Fix For: 4.0 Attachments: patch.txt Marko Bonaci has updated the WIKI page to add the DataImportScheduler, but I cannot find a JIRA ticket for it? http://wiki.apache.org/solr/DataImportHandler Do we have a ticket so the code can be tracked? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053800#comment-13053800 ] Noble Paul commented on SOLR-2382: -- The patch applies well. Suggestions The SolrWriter/DIHPropertiesWriter abstarction can be a separate patch and I can commit it right away . It may also have the changes for passing the handler name . The DIHCache should take the Context as a param and the EntityProcessor does not need to make a copy of the attributes DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a cacheImpl parameter to cause the entity data to be cached (see EntityProcessorBase DIHCacheProperties). 3. Partially De-couple SolrWriter from DocBuilder - Created a new interface DIHWriter, two implementations: - SolrWriter (refactored) - DIHCacheWriter (allows DIH to write ultimately to a Cache). 4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input. 5. Support a partition parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity data. 6. Change the semantics of entity.destroy() - Previously, it was being called on each iteration of DocBuilder.buildDocument(). - Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache) once the entity processor is completed. - The only out-of-the-box entity processor that previously implemented destroy() was LineEntitiyProcessor, so this is not a very invasive change. General Notes: We are near completion in converting our search functionality from a legacy search engine to Solr. However, I found that DIH did not support caching to the level of our prior product's data import utility. In order to
[JENKINS-MAVEN] Lucene-Solr-Maven-3.x #160: POMs out of sync
Build: https://builds.apache.org/job/Lucene-Solr-Maven-3.x/160/ No tests ran. Build Log (for compile errors): [...truncated 8855 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ https://issues.apache.org/jira/browse/LUCENE-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-695: -- Comment: was deleted (was: Read here http://customized-dog-collars.com) Improve BufferedIndexInput.readBytes() performance -- Key: LUCENE-695 URL: https://issues.apache.org/jira/browse/LUCENE-695 Project: Lucene - Java Issue Type: Improvement Components: core/store Affects Versions: 2.0.0 Reporter: Nadav Har'El Priority: Minor Attachments: readbytes.patch, readbytes.patch During a profiling session, I discovered that BufferedIndexInput.readBytes(), the function which reads a bunch of bytes from an index, is very inefficient in many cases. It is efficient for one or two bytes, and also efficient for a very large number of bytes (e.g., when the norms are read all at once); But for anything in between (e.g., 100 bytes), it is a performance disaster. It can easily be improved, though, and below I include a patch to do that. The basic problem in the existing code was that if you ask it to read 100 bytes, readBytes() simply calls readByte() 100 times in a loop, which means we check byte after byte if the buffer has another character, instead of just checking once how many bytes we have left, and copy them all at once. My version, attached below, copies these 100 bytes if they are available at bulk (using System.arraycopy), and if less than 100 are available, whatever is available gets copied, and then the rest. (as before, when a very large number of bytes is requested, it is read directly into the final buffer). In my profiling, this fix caused amazing performance improvement: previously, BufferedIndexInput.readBytes() took as much as 25% of the run time, and after the fix, this was down to 1% of the run time! However, my scenario is *not* the typical Lucene code, but rather a version of Lucene with added payloads, and these payloads average at 100 bytes, where the original readBytes() did worst. I expect that my fix will have less of an impact on vanilla Lucene, but it still can have an impact because it is used for things like reading fields. (I am not aware of a standard Lucene benchmark, so I can't provide benchmarks on a more typical case). In addition to the change to readBytes(), my attached patch also adds a new unit test to BufferedIndexInput (which previously did not have a unit test). This test simulates a file which contains a predictable series of bytes, and then tries to read from it with readByte() and readButes() with various sizes (many thousands of combinations are tried) and see that exactly the expected bytes are read. This test is independent of my new readBytes() inplementation, and can be used to check the old implementation as well. By the way, it's interesting that BufferedIndexOutput.writeBytes was already efficient, and wasn't simply a loop of writeByte(). Only the reading code was inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ https://issues.apache.org/jira/browse/LUCENE-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053807#comment-13053807 ] Uparis Abeysena commented on LUCENE-695: Click: http://customized-dog-collars.com Improve BufferedIndexInput.readBytes() performance -- Key: LUCENE-695 URL: https://issues.apache.org/jira/browse/LUCENE-695 Project: Lucene - Java Issue Type: Improvement Components: core/store Affects Versions: 2.0.0 Reporter: Nadav Har'El Priority: Minor Attachments: readbytes.patch, readbytes.patch During a profiling session, I discovered that BufferedIndexInput.readBytes(), the function which reads a bunch of bytes from an index, is very inefficient in many cases. It is efficient for one or two bytes, and also efficient for a very large number of bytes (e.g., when the norms are read all at once); But for anything in between (e.g., 100 bytes), it is a performance disaster. It can easily be improved, though, and below I include a patch to do that. The basic problem in the existing code was that if you ask it to read 100 bytes, readBytes() simply calls readByte() 100 times in a loop, which means we check byte after byte if the buffer has another character, instead of just checking once how many bytes we have left, and copy them all at once. My version, attached below, copies these 100 bytes if they are available at bulk (using System.arraycopy), and if less than 100 are available, whatever is available gets copied, and then the rest. (as before, when a very large number of bytes is requested, it is read directly into the final buffer). In my profiling, this fix caused amazing performance improvement: previously, BufferedIndexInput.readBytes() took as much as 25% of the run time, and after the fix, this was down to 1% of the run time! However, my scenario is *not* the typical Lucene code, but rather a version of Lucene with added payloads, and these payloads average at 100 bytes, where the original readBytes() did worst. I expect that my fix will have less of an impact on vanilla Lucene, but it still can have an impact because it is used for things like reading fields. (I am not aware of a standard Lucene benchmark, so I can't provide benchmarks on a more typical case). In addition to the change to readBytes(), my attached patch also adds a new unit test to BufferedIndexInput (which previously did not have a unit test). This test simulates a file which contains a predictable series of bytes, and then tries to read from it with readByte() and readButes() with various sizes (many thousands of combinations are tried) and see that exactly the expected bytes are read. This test is independent of my new readBytes() inplementation, and can be used to check the old implementation as well. By the way, it's interesting that BufferedIndexOutput.writeBytes was already efficient, and wasn't simply a loop of writeByte(). Only the reading code was inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex
[ https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053814#comment-13053814 ] Michael McCandless commented on LUCENE-3226: bq. What is the benefit of naming the constant according to what has changed? Because then devs trying to work w/ the code have some sense of what the change was? EG for debugging maybe it's helpful, eg if something has gone wrong, later, in how SegmentInfos is handling that version or what not. bq. And what if two changes occur in the same release? Well, we can handle that case by case? I agree it's messy... maybe pick a name describing/subsuming both? Or favor one name (maybe the bigger change) and use comments to explain the other change? But if there is a comment/comments above the constant containing this same information that's just as good... bq. These constants, IMO, are used only to detect code that is needed to support a certain version, and nothing more. Right, but for the devs that need to revisit such code, it's helpful to know what real change occurred within that version... else, during debugging they'd have to eg go do some svn archaeology to understand the change. bq. And since the purpose of LUCENE-2921 is to move all index format tracking to be at the 'code'-level and not 'feature'-level, I'd assume the constants would be named accordingly. True... so maybe we take this up under that issue? I would be OK with just having comments that describe what changed in each version... So for this issue maybe re-commit just the CheckIndex fix, and leave the constant naming fixes to LUCENE-2921? rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex Key: LUCENE-3226 URL: https://issues.apache.org/jira/browse/LUCENE-3226 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.1, 3.2 Reporter: Hoss Man Fix For: 3.3, 4.0 Attachments: LUCENE-3226.patch A 3.2 user recently asked if something was wrong because CheckIndex was reporting his (newly built) index version as... {noformat} Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1] {noformat} It seems like there are two very confusing pieces of information here... 1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice. All other FORMAT_* constants in SegmentInfos are descriptive of the actual change made, and not specific to the version when they were introduced. 2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it Lucene 3.1, which is missleading since that format is alwasy used in 3.2 (and probably 3.3, etc...). I suggest: a) rename FORMAT_3_1 to something like FORMAT_SEGMENT_RECORDS_VERSION b) change CheckIndex so that the label for the newest format always ends with and later (ie: Lucene 3.1 and later) so when we release versions w/o a format change we don't have to remember to manual list them in CheckIndex. when we *do* make format changes and update CheckIndex and later can be replaced with to X.Y and the new format can be added -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-695) Improve BufferedIndexInput.readBytes() performance
[ https://issues.apache.org/jira/browse/LUCENE-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-695: -- Comment: was deleted (was: Click: http://customized-dog-collars.com) Improve BufferedIndexInput.readBytes() performance -- Key: LUCENE-695 URL: https://issues.apache.org/jira/browse/LUCENE-695 Project: Lucene - Java Issue Type: Improvement Components: core/store Affects Versions: 2.0.0 Reporter: Nadav Har'El Priority: Minor Attachments: readbytes.patch, readbytes.patch During a profiling session, I discovered that BufferedIndexInput.readBytes(), the function which reads a bunch of bytes from an index, is very inefficient in many cases. It is efficient for one or two bytes, and also efficient for a very large number of bytes (e.g., when the norms are read all at once); But for anything in between (e.g., 100 bytes), it is a performance disaster. It can easily be improved, though, and below I include a patch to do that. The basic problem in the existing code was that if you ask it to read 100 bytes, readBytes() simply calls readByte() 100 times in a loop, which means we check byte after byte if the buffer has another character, instead of just checking once how many bytes we have left, and copy them all at once. My version, attached below, copies these 100 bytes if they are available at bulk (using System.arraycopy), and if less than 100 are available, whatever is available gets copied, and then the rest. (as before, when a very large number of bytes is requested, it is read directly into the final buffer). In my profiling, this fix caused amazing performance improvement: previously, BufferedIndexInput.readBytes() took as much as 25% of the run time, and after the fix, this was down to 1% of the run time! However, my scenario is *not* the typical Lucene code, but rather a version of Lucene with added payloads, and these payloads average at 100 bytes, where the original readBytes() did worst. I expect that my fix will have less of an impact on vanilla Lucene, but it still can have an impact because it is used for things like reading fields. (I am not aware of a standard Lucene benchmark, so I can't provide benchmarks on a more typical case). In addition to the change to readBytes(), my attached patch also adds a new unit test to BufferedIndexInput (which previously did not have a unit test). This test simulates a file which contains a predictable series of bytes, and then tries to read from it with readByte() and readButes() with various sizes (many thousands of combinations are tried) and see that exactly the expected bytes are read. This test is independent of my new readBytes() inplementation, and can be used to check the old implementation as well. By the way, it's interesting that BufferedIndexOutput.writeBytes was already efficient, and wasn't simply a loop of writeByte(). Only the reading code was inefficient. I wonder why this happened. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053817#comment-13053817 ] ludovic Boutros commented on LUCENE-3229: - :To reduce surprises like this one when nested spans are used, the ordered case might be changed to require no overlap at all. :To do that one could compare the end of one spans with the beginning of the next one. :AFAIK none of the existing test cases uses a nested span query, so more some test cases for that would be good to have. The patch does exactly that. :The docSpansOrdered method in NearSpansUnordered from the SpanOverLap2.diff patch :is the same as the existing docSpansOrdered method in NearSpansOrdered. :That is probably not intended. It is the same as the actual method because I don't want to modify the current behavior of the NearSpansUnordered class. Overlap should be allowed for unordered near span queries. And if I do not do that, unit test is KO for unordered near span queries. :Could you provide patches as decribed here: http://wiki.apache.org/lucene-java/HowToContribute ? Sorry for that, sure, I will provide the patch shortly. Overlaped SpanNearQuery --- Key: LUCENE-3229 URL: https://issues.apache.org/jira/browse/LUCENE-3229 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.1 Environment: Windows XP, Java 1.6 Reporter: ludovic Boutros Priority: Minor Attachments: SpanOverlap.diff, SpanOverlap2.diff, SpanOverlapTestUnit.diff While using Span queries I think I've found a little bug. With a document like this (from the TestNearSpansOrdered unit test) : w1 w2 w3 w4 w5 If I try to search for this span query : spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true) the above document is returned and I think it should not because 'w4' is not after 'w5'. The 2 spans are not ordered, because there is an overlap. I will add a test patch in the TestNearSpansOrdered unit test. I will add a patch to solve this issue too. Basicaly it modifies the two docSpansOrdered functions to make sure that the spans does not overlap. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3229) Overlaped SpanNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ludovic Boutros updated LUCENE-3229: Attachment: LUCENE-3229.patch Here is the patch as described in the wiki. Is it ok ? Overlaped SpanNearQuery --- Key: LUCENE-3229 URL: https://issues.apache.org/jira/browse/LUCENE-3229 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.1 Environment: Windows XP, Java 1.6 Reporter: ludovic Boutros Priority: Minor Attachments: LUCENE-3229.patch, SpanOverlap.diff, SpanOverlap2.diff, SpanOverlapTestUnit.diff While using Span queries I think I've found a little bug. With a document like this (from the TestNearSpansOrdered unit test) : w1 w2 w3 w4 w5 If I try to search for this span query : spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true) the above document is returned and I think it should not because 'w4' is not after 'w5'. The 2 spans are not ordered, because there is an overlap. I will add a test patch in the TestNearSpansOrdered unit test. I will add a patch to solve this issue too. Basicaly it modifies the two docSpansOrdered functions to make sure that the spans does not overlap. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3232) Move MutableValues to Common Module
[ https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053826#comment-13053826 ] Michael McCandless commented on LUCENE-3232: Patch looks great! I wonder if we should name this module something more specific, eg docvalues? values? Should we also move over ValueSource, *DocValues, FieldCacheSource? I think, then, Solr 3.x grouping could cutover and then group by other field types. Move MutableValues to Common Module --- Key: LUCENE-3232 URL: https://issues.apache.org/jira/browse/LUCENE-3232 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3232.patch, LUCENE-3232.patch Solr makes use of the MutableValue* series of classes to improve performance of grouping by FunctionQuery (I think). As such they are used in ValueSource implementations. Consequently we need to move these classes in order to move the ValueSources. As Yonik pointed out, these classes have use beyond just FunctionQuerys and might be used by both Solr and other modules. However I don't think they belong in Lucene core, since they aren't really related to search functionality. Therefore I think we should put them into a Common module, which can serve as a dependency to Solr and any module. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2618) Indexing and search on more then one object
Indexing and search on more then one object --- Key: SOLR-2618 URL: https://issues.apache.org/jira/browse/SOLR-2618 Project: Solr Issue Type: Improvement Components: clients - java Affects Versions: 3.2 Reporter: Monica Storfjord Priority: Minor It would be very beneficial for a project that I am currently working on to have the ability to index and search on various subclasses of an object and map the objects directly to the actual domain-object. We are planning to do an implementation of this feature but if it is a Solr plugin or something that introduce this feature already if will reduce the development time for us greatly! We are using SolrJ against an Apache Solr 3.2 instance to index, change and search. It should be possible to make a solution that map against a special type field( field name=classtype type=class) in schemas.xml that are indexed every time and use reflection against the actual class? - Monica -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex
[ https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053830#comment-13053830 ] Shai Erera commented on LUCENE-3226: bq. I would be OK with just having comments that describe what changed in each version... Yeah, that's what I thought. Constant name denotes the code version, documentation denotes the actual changes. bq. So for this issue maybe re-commit just the CheckIndex fix I think that that's what Robert and I agreed to do, and we moved to discuss what should be the actual message printed, so it's less confusing to the users. rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex Key: LUCENE-3226 URL: https://issues.apache.org/jira/browse/LUCENE-3226 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.1, 3.2 Reporter: Hoss Man Fix For: 3.3, 4.0 Attachments: LUCENE-3226.patch A 3.2 user recently asked if something was wrong because CheckIndex was reporting his (newly built) index version as... {noformat} Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1] {noformat} It seems like there are two very confusing pieces of information here... 1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice. All other FORMAT_* constants in SegmentInfos are descriptive of the actual change made, and not specific to the version when they were introduced. 2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it Lucene 3.1, which is missleading since that format is alwasy used in 3.2 (and probably 3.3, etc...). I suggest: a) rename FORMAT_3_1 to something like FORMAT_SEGMENT_RECORDS_VERSION b) change CheckIndex so that the label for the newest format always ends with and later (ie: Lucene 3.1 and later) so when we release versions w/o a format change we don't have to remember to manual list them in CheckIndex. when we *do* make format changes and update CheckIndex and later can be replaced with to X.Y and the new format can be added -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term
[ https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053844#comment-13053844 ] Michael McCandless commented on LUCENE-3225: bq. Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here. Well, it only means the enum is unpositioned if you get back NOT_FOUND? Ie, it's just like if you get back null from next(), or END from seek(): in these cases, the enum is unpositioned and you need to call seek again. My worry if we force an up-front decision here (exact only enum vs non-exact only enum) is we prevent legitimate use cases where the caller wants to mix match with one enum. EG, when AutomatonQuery intersects w/ the terms, when it hits are region where terms are denser than what the automaton will accept (such as an infinite part), it should use exact seeking, but then when it's in a region where terms are less dense (eg a finite part) it should use exact seeking I'll open a separate issue for this. The TermsEnum impls can be efficient in this case, ie re-using internal seek state for the exat and non-exact cases (MemoryCodec does this). But I agree another boolean to seek isn't great; maybe instead we can make a seperate seekExact method? Default impl would just call seek (and get no perf gains). BTW, similarly, I think we have a missing API in DISI (for scoring): advance always does a next() if the target doc doesn't match. But we can get substantial performance gains in some cases (see LUCENE-1536) if we had an advanceExact that would not do the next and simply tell us if this doc matched or not. bq. Can we somehow leave the extra CPU work to the term() call and make this entirely lazy? Not sure what you meant here? Optimize TermsEnum.seek when caller doesn't need next term -- Key: LUCENE-3225 URL: https://issues.apache.org/jira/browse/LUCENE-3225 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3225.patch Some codecs are able to save CPU if the caller is only interested in exact matches. EG, Memory codec and SimpleText can do more efficient FSTEnum lookup if they know the caller doesn't need to know the term following the seek term. We have cases like this in Lucene, eg when IW deletes documents by Term, if the term is not found in a given segment then it doesn't need to know the ceiling term. Likewise when TermQuery looks up the term in each segment. I had done this change as part of LUCENE-3030, which is a new terms index that's able to save seeking for exact-only lookups, but now that we have Memory codec that can also save CPU I think we should commit this today. The change adds a boolean onlyExact param to seek(BytesRef). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term
[ https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053846#comment-13053846 ] Michael McCandless commented on LUCENE-3225: This patch gives nice gains for MemoryCodec: I did a quick test w/ my NRT stress test (reopen at 2X Twitter's peak indexing rate) and the reopen time dropped from ~49 msec to ~43 msec (~12% faster). This is impressive because resolving deletes is just one part of opening the NRT reader, ie we also must write the new segment, open SegmentReader against it, etc. Optimize TermsEnum.seek when caller doesn't need next term -- Key: LUCENE-3225 URL: https://issues.apache.org/jira/browse/LUCENE-3225 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3225.patch Some codecs are able to save CPU if the caller is only interested in exact matches. EG, Memory codec and SimpleText can do more efficient FSTEnum lookup if they know the caller doesn't need to know the term following the seek term. We have cases like this in Lucene, eg when IW deletes documents by Term, if the term is not found in a given segment then it doesn't need to know the ceiling term. Likewise when TermQuery looks up the term in each segment. I had done this change as part of LUCENE-3030, which is a new terms index that's able to save seeking for exact-only lookups, but now that we have Memory codec that can also save CPU I think we should commit this today. The change adds a boolean onlyExact param to seek(BytesRef). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053847#comment-13053847 ] Shai Erera commented on LUCENE-3079: I would like to contribute IBM's faceted search package (been wanting to do that for quite a while). The package supports the following features/capabilities (at a high-level): * Taxonomy index -- manages trees of 'categories'. You can view example of taxonomies at e.g. the Open Directory Project. ** It's a Lucene index managed alongside the content index. ** Builds the taxonomy on-the-fly (i.e. as categories are discovered). ** In general it maps a category hierarchy to ordinals (integers). For example, the category /date/2011/06/24 will create the following entry in the taxonomy index: *** /date, ordinal=1 *** /date/2011, ordinal=2 *** /date/2011/06, ordinal=3 *** /date/2011/06/24, ordinal=4 * FacetsDocumentBuilder which receives a list of categories that are associated w/ the document (can be of several dimensions) and: ** Fetches the ordinals of the category components from the taxonomy index (adding them to it on-the-fly). ** Indexes them in a (compressed) payload for the document (so for the above category example, 4 payloads will be indexed for the document). ** FDB can be used to augment a Document with other fields for indexing (it adds its own Field objects). * FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to count and returns the top-K categories for each requested facet: ** The root can denote any node in the category tree (e.g., 'count all facets under /date/2011') ** top-K can be returned for the top most K immediate children of root, or any top-K in the sub-tree of root. * Counting algorithm (at a high-level): ** Fetch the payload for every matching document. ** Increment by 1 the count of every ordinal that was encountered (even for facets that were not requested by the user) ** After all ordinals are counted, compute the top-K on the ones the user requested ** Label the result facets * Miscellaneous features: ** *Sampling* algorithm allows for more efficient facets counting/accumulation, while still returning exact counts for the top-K facets. ** *Complements* algorithm allows for more efficient facets counting/accumulation, when the number of results is 50% of the docs in the index (we keep a total count of facets, count facets on the docs that did not match the query and subtract). *** Complements can be used to count facets that do not appear in any of the matching documents (of this result set). This does not exist in the package though ... yet. ** *Facets partitioning* -- if the taxonomy is huge (i.e. millions of categories), it is better to partition them at indexing time, so that search time is faster and consumes less memory. Note that this is required because of the approach of counting all (allocating a count array) and then keeping only the results of interest. ** *Category enhancements* allow storing 'metadata' with categories in the index, so that more than simple counting can be implemented: *** *weighted facets* (built on top of enhancements) allows associating a weight w/ each category, and use smarter counting techniques at runtime. For example, if facets are generated by an analytics component, the confidence level can be set as the category's weight. If tags are indexed as facets (for e.g. generating a tag cloud for the result set), the number of times the document was tagged by the tag can be set as the tag's weight. ** That that facets are indexed in the payloads of documents allows managing very large taxonomies and indexes, without blowing up the RAM at runtime (but incur some search performance overhead). However, the payloads can be loaded up to RAM (like in FieldCache) in which case runtime becomes much faster. *** However facets are stored is abstracted though by a CountingList API, so we can definitely explore other means of storing the facet ordinals. Actually, the CountingList API allows us to read the ordinals from disk or RAM w/o affecting the rest of the algorithm at all. ** I did not want to dive too deep on the API here, but the runtime API is very extensible and allows one to use FacetsCollector for the simple cases, or lower-level API to get more control on the process. You can look at: FacetRequest, FacetSearchParams, FacetResult, FacetResultNode, FacetsCollector, FacetsAccumulator, FacetsAggregator for a more extensive set of API to use. * The package comes with example code which shows how to use the different features I've mentioned. There are also unit tests for ensuring the example code works :). * The package comes with a very extensive tests suite and is in use by many of our products for a long time, so I can state that it's very stable. * Some rough performance numbers: ** Collection of 1M documents, few hierarchical
[jira] [Commented] (LUCENE-3226) rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex
[ https://issues.apache.org/jira/browse/LUCENE-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053859#comment-13053859 ] Michael McCandless commented on LUCENE-3226: OK I agree, I think :) Who will re-commit the CheckIndex fix here...? rename SegmentInfos.FORMAT_3_1 and improve description in CheckIndex Key: LUCENE-3226 URL: https://issues.apache.org/jira/browse/LUCENE-3226 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.1, 3.2 Reporter: Hoss Man Fix For: 3.3, 4.0 Attachments: LUCENE-3226.patch A 3.2 user recently asked if something was wrong because CheckIndex was reporting his (newly built) index version as... {noformat} Segments file=segments_or numSegments=1 version=FORMAT_3_1 [Lucene 3.1] {noformat} It seems like there are two very confusing pieces of information here... 1) the variable name of SegmentInfos.FORMAT_3_1 seems like poor choice. All other FORMAT_* constants in SegmentInfos are descriptive of the actual change made, and not specific to the version when they were introduced. 2) whatever the name of the FORMAT_* variable, CheckIndex is labeling it Lucene 3.1, which is missleading since that format is alwasy used in 3.2 (and probably 3.3, etc...). I suggest: a) rename FORMAT_3_1 to something like FORMAT_SEGMENT_RECORDS_VERSION b) change CheckIndex so that the label for the newest format always ends with and later (ie: Lucene 3.1 and later) so when we release versions w/o a format change we don't have to remember to manual list them in CheckIndex. when we *do* make format changes and update CheckIndex and later can be replaced with to X.Y and the new format can be added -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3231) Add fixed size DocValues int variants expose Arrays where possible
[ https://issues.apache.org/jira/browse/LUCENE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053865#comment-13053865 ] Michael McCandless commented on LUCENE-3231: This looks great Simon! Add fixed size DocValues int variants expose Arrays where possible Key: LUCENE-3231 URL: https://issues.apache.org/jira/browse/LUCENE-3231 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-3231.patch currently we only have variable bit packed ints implementation. for flexible scoring or loading field caches it is desirable to have fixed int implementations for 8, 16, 32 and 64 bit. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3233) HuperDuperSynonymsFilterâ„¢
HuperDuperSynonymsFilterâ„¢ - Key: LUCENE-3233 URL: https://issues.apache.org/jira/browse/LUCENE-3233 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir The current synonymsfilter uses a lot of ram and cpu, especially at build time. I think yesterday I heard about huge synonyms files three times. So, I think we should use an FST-based structure, sharing the inputs and outputs. And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes() -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3233) HuperDuperSynonymsFilterâ„¢
[ https://issues.apache.org/jira/browse/LUCENE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-3233: Attachment: LUCENE-3233.patch here's a rough start to building a datastructure that I think makes good tradeoffs between RAM and processing. No matter what, the processing on the filter-side will be hairy because of the 'interleaving' with the tokenstream. This one is just an FSTCharsRef,Int[](BYTE4) where Int is an ord to a BytesRefHash, containing the output Bytes for each term. This way, at input time we can walk the FST with codePointAt() On both sides, the Chars/Bytes are actually phrases, using \u as a word separator. HuperDuperSynonymsFilterâ„¢ - Key: LUCENE-3233 URL: https://issues.apache.org/jira/browse/LUCENE-3233 Project: Lucene - Java Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-3233.patch The current synonymsfilter uses a lot of ram and cpu, especially at build time. I think yesterday I heard about huge synonyms files three times. So, I think we should use an FST-based structure, sharing the inputs and outputs. And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes() -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action
[ https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053870#comment-13053870 ] Mark Miller commented on SOLR-2610: --- But you *might* want to (in fact, I do this). If you are really done with a core, if you *really* want to remove it, what do you need the config files around for anymore? Seems like a reasonable option to me - makes no sense as the default I'd agree with. nukeEverything=true ;) Add an option to delete index through CoreAdmin UNLOAD action - Key: SOLR-2610 URL: https://issues.apache.org/jira/browse/SOLR-2610 Project: Solr Issue Type: Improvement Components: multicore Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2610-branch3x.patch, SOLR-2610.patch Right now, one can unload a Solr Core but the index files are left behind and consume disk space. We should have an option to delete the index when unloading a core. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3230) Make FSDirectory.fsync() public and static
[ https://issues.apache.org/jira/browse/LUCENE-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053872#comment-13053872 ] Michael McCandless commented on LUCENE-3230: This seems OK, but my only worry is I'm not sure this way of fsync'ing really works? Ie, this code opens a r/w RAF, calls sync, closes it. It's not clear that this is guaranteed to sync file handles open in the past against the same file. This is something we separately should look into / fix, but with this uncertainty it makes me nervous exposing this as a public API... maybe we could expose it with a big warning. bq. Also, while reviewing the code, I noticed that if IOE occurs, the code sleeps for 5 msec. If an InterruptedException occurs then, it immediately throws ThreadIE, completely ignoring the fact that it slept due to IOE. Shouldn't we at least pass IOE.getMessage() on ThreadIE? +1 Make FSDirectory.fsync() public and static -- Key: LUCENE-3230 URL: https://issues.apache.org/jira/browse/LUCENE-3230 Project: Lucene - Java Issue Type: New Feature Components: core/store Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.3, 4.0 I find FSDirectory.fsync() (today protected and instance method) very useful as a utility to sync() files. I'd like create a FSDirectory.sync() utility which contains the exact same impl of FSDir.fsync(), and have the latter call it. We can have it part of IOUtils too, as it's a completely standalone utility. I would get rid of FSDir.fsync() if it wasn't protected (as if encouraging people to override it). I doubt anyone really overrides it (our core Directories don't). Also, while reviewing the code, I noticed that if IOE occurs, the code sleeps for 5 msec. If an InterruptedException occurs then, it immediately throws ThreadIE, completely ignoring the fact that it slept due to IOE. Shouldn't we at least pass IOE.getMessage() on ThreadIE? The patch is trivial, so I'd like to get some feedback before I post it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3232) Move MutableValues to Common Module
[ https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053876#comment-13053876 ] Simon Willnauer commented on LUCENE-3232: - bq. I wonder if we should name this module something more specific, eg docvalues? values? dude! no! :) Move MutableValues to Common Module --- Key: LUCENE-3232 URL: https://issues.apache.org/jira/browse/LUCENE-3232 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3232.patch, LUCENE-3232.patch Solr makes use of the MutableValue* series of classes to improve performance of grouping by FunctionQuery (I think). As such they are used in ValueSource implementations. Consequently we need to move these classes in order to move the ValueSources. As Yonik pointed out, these classes have use beyond just FunctionQuerys and might be used by both Solr and other modules. However I don't think they belong in Lucene core, since they aren't really related to search functionality. Therefore I think we should put them into a Common module, which can serve as a dependency to Solr and any module. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term
[ https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053885#comment-13053885 ] Simon Willnauer commented on LUCENE-3225: - {quote} BTW, similarly, I think we have a missing API in DISI (for scoring): advance always does a next() if the target doc doesn't match. But we can get substantial performance gains in some cases (see LUCENE-1536) if we had an advanceExact that would not do the next and simply tell us if this doc matched or not. {quote} +1!! {quote} But I agree another boolean to seek isn't great; maybe instead we can make a seperate seekExact method? Default impl would just call seek (and get no perf gains). {quote} thats another option and I like that better though. Yet the other should the be seekFloor no? bq. not sure what you meant here? nevermind I only looked at the top of the patch and figured that we only safe the loading into bytesref but there is more about it... Optimize TermsEnum.seek when caller doesn't need next term -- Key: LUCENE-3225 URL: https://issues.apache.org/jira/browse/LUCENE-3225 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3225.patch Some codecs are able to save CPU if the caller is only interested in exact matches. EG, Memory codec and SimpleText can do more efficient FSTEnum lookup if they know the caller doesn't need to know the term following the seek term. We have cases like this in Lucene, eg when IW deletes documents by Term, if the term is not found in a given segment then it doesn't need to know the ceiling term. Likewise when TermQuery looks up the term in each segment. I had done this change as part of LUCENE-3030, which is a new terms index that's able to save seeking for exact-only lookups, but now that we have Memory codec that can also save CPU I think we should commit this today. The change adds a boolean onlyExact param to seek(BytesRef). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053886#comment-13053886 ] Jan Høydahl commented on LUCENE-3079: - Bravo Shai IBM! Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS-MAVEN] Lucene-Solr-Maven-trunk #157: POMs out of sync
Build: https://builds.apache.org/job/Lucene-Solr-Maven-trunk/157/ No tests ran. Build Log (for compile errors): [...truncated 7493 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053940#comment-13053940 ] Ryan McKinley commented on LUCENE-3079: --- bq. Bravo Shai IBM! +1! This sounds awesome, and I hope will prove how modules will help lucene *and* solr Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3225) Optimize TermsEnum.seek when caller doesn't need next term
[ https://issues.apache.org/jira/browse/LUCENE-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053964#comment-13053964 ] Michael McCandless commented on LUCENE-3225: bq. Yet the other should the be seekFloor no? Ahhh right, we had discussed on the dev list. I agree! But, we should do this in another issue. Though, I think we should rename the current seek to seekCeil; I'll do that here. Optimize TermsEnum.seek when caller doesn't need next term -- Key: LUCENE-3225 URL: https://issues.apache.org/jira/browse/LUCENE-3225 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-3225.patch Some codecs are able to save CPU if the caller is only interested in exact matches. EG, Memory codec and SimpleText can do more efficient FSTEnum lookup if they know the caller doesn't need to know the term following the seek term. We have cases like this in Lucene, eg when IW deletes documents by Term, if the term is not found in a given segment then it doesn't need to know the ceiling term. Likewise when TermQuery looks up the term in each segment. I had done this change as part of LUCENE-3030, which is a new terms index that's able to save seeking for exact-only lookups, but now that we have Memory codec that can also save CPU I think we should commit this today. The change adds a boolean onlyExact param to seek(BytesRef). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action
[ https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053992#comment-13053992 ] Jason Rutherglen commented on SOLR-2610: Mark put it aptly. The problem I think I encountered in my own version is left over file handles seemed to be preventing the deletion of all the files, many times some of them would be left over. Also I deleted the entire core directory, which is useful for manual testing (eg, to avoid the directory exists exception). Add an option to delete index through CoreAdmin UNLOAD action - Key: SOLR-2610 URL: https://issues.apache.org/jira/browse/SOLR-2610 Project: Solr Issue Type: Improvement Components: multicore Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2610-branch3x.patch, SOLR-2610.patch Right now, one can unload a Solr Core but the index files are left behind and consume disk space. We should have an option to delete the index when unloading a core. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Sokolov updated LUCENE-3234: - Attachment: LUCENE-3234.patch Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054007#comment-13054007 ] Robert Muir commented on LUCENE-3234: - I like this tradeoff Mike, thanks! should we consider setting some kind of absurd default like 10,000 to really prevent some pathological cases with huge documents? We could document in CHANGES.txt that if you want the old behavior, set it to -1 or Integer.MAX_VALUE (I think we can use this here? offsets are ints?) Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054014#comment-13054014 ] Robert Muir commented on LUCENE-3234: - yeah, you are right.. but seeing as how positions are ints too, I think it might be easier to do Integer.MAX_VALUE versus the -1 parameter. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054010#comment-13054010 ] Mike Sokolov commented on LUCENE-3234: -- Yes, although a smaller number might be fine. Maybe Koji will comment: I don't completely understand the scaling here, but it seemed to me that I had a case with around 2000 occurrences of a term that lead to a 15-20 sec evaluation time on my desktop. The max value will be an int, sire, although I think the number is going to scale like positions, not offsets FWIW. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054026#comment-13054026 ] Paul Elschot commented on LUCENE-3229: -- Thanks for bringing this up, this has confused more people in the past, and that could well be over now. Overlaped SpanNearQuery --- Key: LUCENE-3229 URL: https://issues.apache.org/jira/browse/LUCENE-3229 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.1 Environment: Windows XP, Java 1.6 Reporter: ludovic Boutros Priority: Minor Attachments: LUCENE-3229.patch, LUCENE-3229.patch, SpanOverlap.diff, SpanOverlap2.diff, SpanOverlapTestUnit.diff While using Span queries I think I've found a little bug. With a document like this (from the TestNearSpansOrdered unit test) : w1 w2 w3 w4 w5 If I try to search for this span query : spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true) the above document is returned and I think it should not because 'w4' is not after 'w5'. The 2 spans are not ordered, because there is an overlap. I will add a test patch in the TestNearSpansOrdered unit test. I will add a patch to solve this issue too. Basicaly it modifies the two docSpansOrdered functions to make sure that the spans does not overlap. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3229) Overlaped SpanNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-3229: - Attachment: LUCENE-3229.patch Basically the same functionality as previous patch by Ludovic Boutros. Simplified the check for non overlapping spans, this might speed it up somewhat. Added javadoc explanations on ordered without overlap and unordered with overlap. Minor spelling and indentation changes. NearSpansOrdered might be further simplified as not all locals are actually used now because of the simplified check, but for now I prefer to leave that to the JIT to optimize away. Overlaped SpanNearQuery --- Key: LUCENE-3229 URL: https://issues.apache.org/jira/browse/LUCENE-3229 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.1 Environment: Windows XP, Java 1.6 Reporter: ludovic Boutros Priority: Minor Attachments: LUCENE-3229.patch, LUCENE-3229.patch, SpanOverlap.diff, SpanOverlap2.diff, SpanOverlapTestUnit.diff While using Span queries I think I've found a little bug. With a document like this (from the TestNearSpansOrdered unit test) : w1 w2 w3 w4 w5 If I try to search for this span query : spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true) the above document is returned and I think it should not because 'w4' is not after 'w5'. The 2 spans are not ordered, because there is an overlap. I will add a test patch in the TestNearSpansOrdered unit test. I will add a patch to solve this issue too. Basicaly it modifies the two docSpansOrdered functions to make sure that the spans does not overlap. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3229) Overlaped SpanNearQuery
[ https://issues.apache.org/jira/browse/LUCENE-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054045#comment-13054045 ] ludovic Boutros commented on LUCENE-3229: - Thanks Paul, do you have any idea when this patch will be applied to the branch 3x ? Overlaped SpanNearQuery --- Key: LUCENE-3229 URL: https://issues.apache.org/jira/browse/LUCENE-3229 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.1 Environment: Windows XP, Java 1.6 Reporter: ludovic Boutros Priority: Minor Attachments: LUCENE-3229.patch, LUCENE-3229.patch, SpanOverlap.diff, SpanOverlap2.diff, SpanOverlapTestUnit.diff While using Span queries I think I've found a little bug. With a document like this (from the TestNearSpansOrdered unit test) : w1 w2 w3 w4 w5 If I try to search for this span query : spanNear([spanNear([field:w3, field:w5], 1, true), field:w4], 0, true) the above document is returned and I think it should not because 'w4' is not after 'w5'. The 2 spans are not ordered, because there is an overlap. I will add a test patch in the TestNearSpansOrdered unit test. I will add a patch to solve this issue too. Basicaly it modifies the two docSpansOrdered functions to make sure that the spans does not overlap. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug Key: LUCENE-3235 URL: https://issues.apache.org/jira/browse/LUCENE-3235 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Not sure what's going on yet... but under Java 1.6 it seems not to hang bug under Java 1.5 hangs fairly easily, on Linux. Java is 1.5.0_22. I suspect this is relevant: http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock which refers to this JVM bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370 It looks like that last bug was fixed in Java 1.6 but not 1.5. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
[ https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054066#comment-13054066 ] Robert Muir commented on LUCENE-3235: - +1 to drop java 5 TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug Key: LUCENE-3235 URL: https://issues.apache.org/jira/browse/LUCENE-3235 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Not sure what's going on yet... but under Java 1.6 it seems not to hang bug under Java 1.5 hangs fairly easily, on Linux. Java is 1.5.0_22. I suspect this is relevant: http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock which refers to this JVM bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370 It looks like that last bug was fixed in Java 1.6 but not 1.5. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
[ https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054067#comment-13054067 ] Uwe Schindler commented on LUCENE-3235: --- LOL, no comment. TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug Key: LUCENE-3235 URL: https://issues.apache.org/jira/browse/LUCENE-3235 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Not sure what's going on yet... but under Java 1.6 it seems not to hang bug under Java 1.5 hangs fairly easily, on Linux. Java is 1.5.0_22. I suspect this is relevant: http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock which refers to this JVM bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370 It looks like that last bug was fixed in Java 1.6 but not 1.5. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
[ https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054070#comment-13054070 ] Robert Muir commented on LUCENE-3235: - i ran the test with the same version as mike (1.5.0_22) in two ways on windows: * -Dtests.iter=100 * in a loop from a script, 100 times with its own ant run. i can't reproduce it on windows. in my eyes, there isn't even an argument about whether or not we should support java5: its not possible, if bugs are not getting fixed. TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug Key: LUCENE-3235 URL: https://issues.apache.org/jira/browse/LUCENE-3235 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Not sure what's going on yet... but under Java 1.6 it seems not to hang bug under Java 1.5 hangs fairly easily, on Linux. Java is 1.5.0_22. I suspect this is relevant: http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock which refers to this JVM bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370 It looks like that last bug was fixed in Java 1.6 but not 1.5. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3235) TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug
[ https://issues.apache.org/jira/browse/LUCENE-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054071#comment-13054071 ] Michael McCandless commented on LUCENE-3235: Still hangs if I run -client; but it looks like -Xint prevents the hang (235 iterations so far on beast). 3.2 also hangs. TestDoubleBarrelLRUCache hangs under Java 1.5, 3.x and trunk, likely JVM bug Key: LUCENE-3235 URL: https://issues.apache.org/jira/browse/LUCENE-3235 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Not sure what's going on yet... but under Java 1.6 it seems not to hang bug under Java 1.5 hangs fairly easily, on Linux. Java is 1.5.0_22. I suspect this is relevant: http://stackoverflow.com/questions/3292577/is-it-possible-for-concurrenthashmap-to-deadlock which refers to this JVM bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6865591 which then refers to this one http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370 It looks like that last bug was fixed in Java 1.6 but not 1.5. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[VOTE] release 3.3
Artifacts here: http://s.apache.org/lusolr33rc0 working release notes here: http://wiki.apache.org/lucene-java/ReleaseNote33 http://wiki.apache.org/solr/ReleaseNote33 I ran the automated release test script in trunk/dev-tools/scripts/smokeTestRelease.py, and ran 'ant test' at the top level 50 times on windows. Here is my +1 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action
[ https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054079#comment-13054079 ] Shawn Heisey commented on SOLR-2610: I can think of a corollary core action I'd like to see -- the ability on a core RELOAD to entirely delete the index from a core and replace it with a fresh empty index that will start building at segment _0. I would do this to my build core before using it, and later after swapping it with the live core and ensuring it's good, to free up disk space. Add an option to delete index through CoreAdmin UNLOAD action - Key: SOLR-2610 URL: https://issues.apache.org/jira/browse/SOLR-2610 Project: Solr Issue Type: Improvement Components: multicore Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2610-branch3x.patch, SOLR-2610.patch Right now, one can unload a Solr Core but the index files are left behind and consume disk space. We should have an option to delete the index when unloading a core. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054093#comment-13054093 ] Mike Sokolov commented on LUCENE-3234: -- Yes, that makes sense to me - default to 5000, say, and set explicitly to either MAX_VALUE or -1 to get the unlimited behavior (I prefer to allow -1 since otherwise you should probably treat it as an error). Do you want me to change the patch, or should I just leave that to the committer? Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3236) Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter
Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter -- Key: LUCENE-3236 URL: https://issues.apache.org/jira/browse/LUCENE-3236 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 4.0 Environment: N/A Reporter: Sujit Pal Priority: Minor Fix For: 4.0 PorterStemFilter has functionality to detect if a term has been marked as a keyword by the KeywordMarkerFilter (KeywordAttribute.isKeyword() == true), and if so, skip stemming. The suggestion is to have the same functionality in other filters where it is applicable. I think it may be particularly applicable to the LowerCaseFilter (ie if it is a keyword, don't mess with the case), and StopFilter (if it is a keyword, then don't filter it out even if it looks like a stop word). Backward compatibility is maintained (in both cases) by adding a new constructor which takes an additional boolean parameter ignoreKeyword. The current constructor will call this new constructor with ignoreKeyword = false. Patches are attached (for LowerCaseFilter and StopFilter). I have verified that the analysis JUnit tests run against the updated code, ie, backward compatibility is maintained. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3236) Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter
[ https://issues.apache.org/jira/browse/LUCENE-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujit Pal updated LUCENE-3236: -- Attachment: lucene-3236-patch.diff Patch generated with svn diff from the top level Lucene/Solr trunk. Contains updates to LowerCaseFilter and StopFilter to recognize and NOT operate on terms marked with KeywordAttribute.isKeyword. (NOTE: also contains changes to changes2html.pl which seem to have been generated automatically). Make LowerCaseFilter and StopFilter keyword aware, similar to PorterStemFilter -- Key: LUCENE-3236 URL: https://issues.apache.org/jira/browse/LUCENE-3236 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 4.0 Environment: N/A Reporter: Sujit Pal Priority: Minor Labels: analysis Fix For: 4.0 Attachments: lucene-3236-patch.diff PorterStemFilter has functionality to detect if a term has been marked as a keyword by the KeywordMarkerFilter (KeywordAttribute.isKeyword() == true), and if so, skip stemming. The suggestion is to have the same functionality in other filters where it is applicable. I think it may be particularly applicable to the LowerCaseFilter (ie if it is a keyword, don't mess with the case), and StopFilter (if it is a keyword, then don't filter it out even if it looks like a stop word). Backward compatibility is maintained (in both cases) by adding a new constructor which takes an additional boolean parameter ignoreKeyword. The current constructor will call this new constructor with ignoreKeyword = false. Patches are attached (for LowerCaseFilter and StopFilter). I have verified that the analysis JUnit tests run against the updated code, ie, backward compatibility is maintained. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2382) DIH Cache Improvements
[ https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054104#comment-13054104 ] James Dyer commented on SOLR-2382: -- {quote} The DIHCache should take the Context as a param and the EntityProcessor does not need to make a copy of the attributes {quote} I started down this road this afternoon hoping to have you another patch version to look at today. But it turns out to be more complicated than I first anticipated. Any ideas how to get around these difficulties? - DIHCacheProcessor enforces readOnly=false and deletePriorData=true. It also modifies the cacheName if the user specifies partitions. Seeing that Context-Entity-Attributes are immutable, should I pass these as Entity-Scope-Session-Attributes? - cacheInit() in EntityProcessorBase specifically passes only the parameters that apply to the current situation. This way, if a user applies something non-applicable they are safely ignored rather than getting undefined behavior. Just forwarding the context on doesn't give this flexibility. Do you think its ok to just forward on the context anyway? - DocBuilder instantiates DIHCacheWriter which in turn gets the user-specified Cache Implementation and instantiates that. At this point in time, there doesn't seem to be a Context to pass. So, rather than do this in the constructor, is there a safer place down the road where I should be instantiating the DIHCacheWriter? I realize that its more lines of code to always copy these properties into a property map to send to the cache, but I was looking at the cache at being a layer down in the stack and maybe it shouldn't have the whole context sent to it. What do you think? DIH Cache Improvements -- Key: SOLR-2382 URL: https://issues.apache.org/jira/browse/SOLR-2382 Project: Solr Issue Type: New Feature Components: contrib - DataImportHandler Reporter: James Dyer Priority: Minor Attachments: SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch Functionality: 1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application. 2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. 4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. Use Cases: 1. We needed a flexible scalable way to temporarily cache child-entity data prior to joining to parent entities. - Using SqlEntityProcessor with Child Entities can cause an n+1 select problem. - CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale. - There is no way to cache non-SQL inputs (ex: flat files, xml, etc). 2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process. 3. We wanted the ability to do a delta import of only the entities that changed. - Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed. - Our data comes from 50+ complex sql queries and/or flat files. - We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed. - Persistent DIH caches solve this problem. 4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the threads parameter). 5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards. Implementation Details: 1. De-couple EntityProcessorBase from caching. - Created a new interface, DIHCache two implementations: - SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated). - BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar - NOTE: the existing Lucene Contrib db project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage. - NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html 2. Allow Entity Processors to take a
[jira] [Commented] (LUCENE-3232) Move MutableValues to Common Module
[ https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054110#comment-13054110 ] Chris Male commented on LUCENE-3232: {quote} I wonder if we should name this module something more specific, eg docvalues? values? Should we also move over ValueSource, *DocValues, FieldCacheSource? I think, then, Solr 3.x grouping could cutover and then group by other field types. {quote} To be honest, that wasn't my plan :D My plan is to first move these to a Common module which will serve basically as a utility module for other modules. The MutableValue classes are useful in a number of places (or will be in the future). I envisage other useful utility like classes going into this module in the future too. Solr for example has a number of very useful utilities that might be of benefit. As such, it doesn't really relate to FunctionQuerys or ValueSources. The next step once this is complete is to do what I originally intended and make a Queries module and push FunctionQuery and all the ValueSources / DocValues into that. In the end you get the following structure: modules/ common/ (MutableValue*) queries/ (FunctionQuery, *DocValues, *ValueSource, Queries from contrib/queries) Seem reasonable? Move MutableValues to Common Module --- Key: LUCENE-3232 URL: https://issues.apache.org/jira/browse/LUCENE-3232 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3232.patch, LUCENE-3232.patch Solr makes use of the MutableValue* series of classes to improve performance of grouping by FunctionQuery (I think). As such they are used in ValueSource implementations. Consequently we need to move these classes in order to move the ValueSources. As Yonik pointed out, these classes have use beyond just FunctionQuerys and might be used by both Solr and other modules. However I don't think they belong in Lucene core, since they aren't really related to search functionality. Therefore I think we should put them into a Common module, which can serve as a dependency to Solr and any module. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3232) Move MutableValues to Common Module
[ https://issues.apache.org/jira/browse/LUCENE-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054110#comment-13054110 ] Chris Male edited comment on LUCENE-3232 at 6/23/11 9:28 PM: - {quote} I wonder if we should name this module something more specific, eg docvalues? values? Should we also move over ValueSource, *DocValues, FieldCacheSource? I think, then, Solr 3.x grouping could cutover and then group by other field types. {quote} To be honest, that wasn't my plan :D My plan is to first move these to a Common module which will serve basically as a utility module for other modules. The MutableValue classes are useful in a number of places (or will be in the future). I envisage other useful utility like classes going into this module in the future too. Solr for example has a number of very useful utilities that might be of benefit. As such, it doesn't really relate to FunctionQuerys or ValueSources. The next step once this is complete is to do what I originally intended and make a Queries module and push FunctionQuery and all the ValueSources / DocValues into that. In the end you get the following structure: {code} modules/ common/ (MutableValue*) queries/ (FunctionQuery, *DocValues, *ValueSource, Queries from contrib/queries) {code} Seem reasonable? was (Author: cmale): {quote} I wonder if we should name this module something more specific, eg docvalues? values? Should we also move over ValueSource, *DocValues, FieldCacheSource? I think, then, Solr 3.x grouping could cutover and then group by other field types. {quote} To be honest, that wasn't my plan :D My plan is to first move these to a Common module which will serve basically as a utility module for other modules. The MutableValue classes are useful in a number of places (or will be in the future). I envisage other useful utility like classes going into this module in the future too. Solr for example has a number of very useful utilities that might be of benefit. As such, it doesn't really relate to FunctionQuerys or ValueSources. The next step once this is complete is to do what I originally intended and make a Queries module and push FunctionQuery and all the ValueSources / DocValues into that. In the end you get the following structure: modules/ common/ (MutableValue*) queries/ (FunctionQuery, *DocValues, *ValueSource, Queries from contrib/queries) Seem reasonable? Move MutableValues to Common Module --- Key: LUCENE-3232 URL: https://issues.apache.org/jira/browse/LUCENE-3232 Project: Lucene - Java Issue Type: Sub-task Components: core/search Reporter: Chris Male Fix For: 4.0 Attachments: LUCENE-3232.patch, LUCENE-3232.patch Solr makes use of the MutableValue* series of classes to improve performance of grouping by FunctionQuery (I think). As such they are used in ValueSource implementations. Consequently we need to move these classes in order to move the ValueSources. As Yonik pointed out, these classes have use beyond just FunctionQuerys and might be used by both Solr and other modules. However I don't think they belong in Lucene core, since they aren't really related to search functionality. Therefore I think we should put them into a Common module, which can serve as a dependency to Solr and any module. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054114#comment-13054114 ] Robert Muir commented on LUCENE-3234: - You can change it if you don't mind. However, I think I agree it would be good to figure out if there is an n^2 here. This might have some affect on what the default value should be... ideally there is some way we could fix the n^2. Is there a way to turn your test case into a benchmark, or do you have a separate benchmark (the example you mentioned where it blows up really bad). This could help in looking at what's going on. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054125#comment-13054125 ] Mike Sokolov commented on LUCENE-3234: -- I don't think I can share the test documents I have - they belong to someone else. I can look at trying to make something bad happen with the wikipedia data, but I'm curious why a benchmark is preferable to a test case? Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054133#comment-13054133 ] Robert Muir commented on LUCENE-3234: - oh thats ok, i just meant a little tiny benchmark, hitting the nasty case that we might think might be n^2. If the little test case does that... then that will work, just wasn't sure if it did. either way just something to look at in the profiler, etc. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054162#comment-13054162 ] Mike Sokolov commented on LUCENE-3234: -- I did go back and look at the original case that made me worried; in that case the bad document is 650K, and the matched term occurs 23000 times in it. The search still finishes in 24 sec or so on my desktop, which isn't too bad I guess, considering. After looking at that and measuring the change in the test case in the patch as the number of terms increase, I don't think there actually is an n^2 - just linear, but the growth is still enough that the patch has value. The test case in the patch is closely targeted at the method which takes all the time when you have large numbers of matching terms in a single document. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] release 3.3
+1 Thanks for pulling the release together Robert. On Fri, Jun 24, 2011 at 9:33 AM, Michael McCandless luc...@mikemccandless.com wrote: +1 Smoke testing passed for me, except for the Java 1.5 only hang in TestDoubleBarrelLRUCache (LUCENE-3235) but I don't think that should block the release. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 23, 2011 at 4:18 PM, Robert Muir rcm...@gmail.com wrote: Artifacts here: http://s.apache.org/lusolr33rc0 working release notes here: http://wiki.apache.org/lucene-java/ReleaseNote33 http://wiki.apache.org/solr/ReleaseNote33 I ran the automated release test script in trunk/dev-tools/scripts/smokeTestRelease.py, and ran 'ant test' at the top level 50 times on windows. Here is my +1 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054193#comment-13054193 ] Koji Sekiguchi commented on LUCENE-3234: Mike, thank you for your continuous interest to FVH! Can you add the parameter for Solr, with an appropriate default value if you would like. I don't know assertTrue test in testManyRepeatedTerms() is ok, for JENKINS? Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi updated LUCENE-3234: --- Affects Version/s: 3.3 2.9.4 3.0.3 3.1 3.2 Fix Version/s: 4.0 3.4 Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned LUCENE-3234: -- Assignee: Koji Sekiguchi Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2429) ability to not cache a filter
[ https://issues.apache.org/jira/browse/SOLR-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated SOLR-2429: --- Attachment: SOLR-2429.patch Here's a patch that allows one to add cache=false to top level queries (main queries, filter queries, facet queries, etc). Currently (without this patch) Solr generates the set of documents that match each filter individually (this is so they can be cached and reused). Adding cache=false to the main query prevents lookup/storing in the query cache. Adding cache=false to any filter query causes the filterCache to not be used. Further, the filter query is actually run in parallel to the main query and any other non-cached filter queries (which can speed things up if the base query or other filter queries are relatively sparse). There is also an optional cost parameter that controls the order in which non-cached filter queries are evaluated so knowledgable users can order less expensive non-cached filters before expensive non-cached filters. As an additional feature for very high cost filters, if cache=false and cost=100 and the query implements the PostFilter interface, a Collector will be requested from that query and used to filter documents after they have matched the main query and all other filter queries. There can be multiple post filters, and they are also ordered by cost. The frange query (a range over function queries, background here: http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/ ) also now implements PostFilter. Examples: {code} // normal function range query used as a filter, all matching documents generated up front and cached fq={!frange l=10 u=100}mul(popularity,price) // function range query run in parallel with the main query like a traditional lucene filter fq={!frange l=10 u=100 cache=false}mul(popularity,price) // function range query checked after each document that already matches the query and all other filters. Good for really expensive function queries. fq={!frange l=10 u=100 cache=false cost=100}mul(popularity,price) {code} ability to not cache a filter - Key: SOLR-2429 URL: https://issues.apache.org/jira/browse/SOLR-2429 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Attachments: SOLR-2429.patch A user should be able to add {!cache=false} to a query or filter query. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Sokolov updated LUCENE-3234: - Attachment: LUCENE-3234.patch Added solr parameter hl.phraseLimit (default=5000) Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Sokolov updated LUCENE-3234: - Attachment: LUCENE-3234.patch Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Sokolov updated LUCENE-3234: - Attachment: (was: LUCENE-3234.patch) Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054213#comment-13054213 ] Mike Sokolov edited comment on LUCENE-3234 at 6/24/11 2:06 AM: --- Added solr parameter hl.phraseLimit (default=5000) Koji - I'm not sure what the issue w/assertTrue is? It looked to me as if the test case ultimately inherits from org.junit.Assert, which defines the method? Is there a different version of junit on Jenkins without that method? was (Author: sokolov): Added solr parameter hl.phraseLimit (default=5000) Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3230) Make FSDirectory.fsync() public and static
[ https://issues.apache.org/jira/browse/LUCENE-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054234#comment-13054234 ] Shai Erera commented on LUCENE-3230: I opened 2 RAF instances over the same file and called getFD() on each. I've received 2 instances of FD, each had a different 'handle' value. So I agree it's not clear whether syncing one FD affects the other. I also found some posts on the web claiming that FS.sync() doesn't really work on all OS, and that some hardware manufacturers enable hardware write caching, so even if the OS obeys the sync() call, the HW may not. I guess there's not much we can do about the HW case. * http://hardware.slashdot.org/story/05/05/13/0529252/Your-Hard-Drive-Lies-to-You * http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html (bad fsync on Mac OS X) So perhaps we shouldn't make it a public API after all. If someone can sync() on the same OutputStream he used to flush/close, it's better than how we do it today. I hate to introduce public API w/ big warnings. As for our case (Lucene usage of sync()), it'd be good if we can sync() in the IndexOutput that wrote the data. So maybe we should add sync() to IO? Not sure how it will play out in Directory, which today syncs file names, and not IndexOutput instances. Shall I close this issue then? Or rename it to handle the IO.sync() issue? Make FSDirectory.fsync() public and static -- Key: LUCENE-3230 URL: https://issues.apache.org/jira/browse/LUCENE-3230 Project: Lucene - Java Issue Type: New Feature Components: core/store Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.3, 4.0 I find FSDirectory.fsync() (today protected and instance method) very useful as a utility to sync() files. I'd like create a FSDirectory.sync() utility which contains the exact same impl of FSDir.fsync(), and have the latter call it. We can have it part of IOUtils too, as it's a completely standalone utility. I would get rid of FSDir.fsync() if it wasn't protected (as if encouraging people to override it). I doubt anyone really overrides it (our core Directories don't). Also, while reviewing the code, I noticed that if IOE occurs, the code sleeps for 5 msec. If an InterruptedException occurs then, it immediately throws ThreadIE, completely ignoring the fact that it slept due to IOE. Shouldn't we at least pass IOE.getMessage() on ThreadIE? The patch is trivial, so I'd like to get some feedback before I post it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org