[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552497#comment-13552497 ] Dawid Weiss commented on LUCENE-4682: - Yeah, there are many ideas layered on top of each other and it's gotten beyond the point of being easy to comprehend. As for the next bit -- in any implementation I've seen this leads to significant reduction in automaton size. But I'm not saying it's the optimal way to do it, perhaps there are other encoding options that would reach similar compression levels without the added complexity. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB
[ https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552498#comment-13552498 ] Dawid Weiss commented on LUCENE-3298: - The impact will show on 32-bit systems I'm pretty sure of that. We don't care about hardware archaeology, do we? :) +1. FST has hard limit max size of 2.1 GB - Key: LUCENE-3298 URL: https://issues.apache.org/jira/browse/LUCENE-3298 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch The FST uses a single contiguous byte[] under the hood, which in java is indexed by int so we cannot grow this over Integer.MAX_VALUE. It also internally encodes references to this array as vInt. We could switch this to a paged byte[] and make the far larger. But I think this is low priority... I'm not going to work on it any time soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment
[ https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-4683: --- Attachment: LUCENE-4683.patch * Added setNextReader to CategoryListIterator (instead of init()) and Aggregator. * Modified StandardFacetsAccumulator to iterate of the segment's atomic readers and call setNextReader accordingly. * Fixed an issue in ScoredDocIdsUtils where it assumed ScoredDocIDs are OpenBitSet where for a long time they are FixedBitSet. This caused unnecessary copy from FixedBitSet to OpenBitSet. * Most of the other changes are API changes, i.e. createCategoryListIterator no longer takes an IndexReader etc. I didn't add yet a CHANGES line because I'm not sure if this will make it into 4.1. Basically it's ready to go in (all tests pass), so I'll check later today what's the status of the 4.1 branch and decide accordingly. This now makes the cutover to DocValues even easier. That's what I'd like to do next. Change Aggregator and CategoryListIterator to be per-segment Key: LUCENE-4683 URL: https://issues.apache.org/jira/browse/LUCENE-4683 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Attachments: LUCENE-4683.patch As another improvement, these two (mostly CategoryListIterator) should be per-segment. I've got a patch nearly ready, will post tomorrow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info
Shahar Davidson created SOLR-4302: - Summary: Improve CoreAdmin STATUS request response time by allowing user to omit the Index info Key: SOLR-4302 URL: https://issues.apache.org/jira/browse/SOLR-4302 Project: Solr Issue Type: Improvement Components: multicore Affects Versions: 4.0, 4.1, 5.0 Reporter: Shahar Davidson Priority: Minor In large multicore environments (hundreds+ of cores), the STATUS request may take a fair amount of time. It seems that the majority of time is spent retrieving the index related info. The suggested improvement allows the user to specify a parameter (indexInfo) that if 'false' then index related info (such as segmentCount, sizeInBytes, numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' (to maintain existing STATUS request behavior). For example, when tested on a given machine with 380+ solr cores, the full STATUS request took 800ms-900ms, whereas using indexInfo=false returned results in about 1ms-4ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info
[ https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shahar Davidson updated SOLR-4302: -- Attachment: SOLR-4302.patch SOLR-4302, apply over trunk 1404975 Improve CoreAdmin STATUS request response time by allowing user to omit the Index info -- Key: SOLR-4302 URL: https://issues.apache.org/jira/browse/SOLR-4302 Project: Solr Issue Type: Improvement Components: multicore Affects Versions: 4.0, 4.1, 5.0 Reporter: Shahar Davidson Priority: Minor Labels: performance Attachments: SOLR-4302.patch In large multicore environments (hundreds+ of cores), the STATUS request may take a fair amount of time. It seems that the majority of time is spent retrieving the index related info. The suggested improvement allows the user to specify a parameter (indexInfo) that if 'false' then index related info (such as segmentCount, sizeInBytes, numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' (to maintain existing STATUS request behavior). For example, when tested on a given machine with 380+ solr cores, the full STATUS request took 800ms-900ms, whereas using indexInfo=false returned results in about 1ms-4ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info
[ https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552544#comment-13552544 ] Shahar Davidson edited comment on SOLR-4302 at 1/14/13 10:00 AM: - Attached suggested patch SOLR-4302.patch. apply over trunk 1404975 was (Author: shahar.davidson): SOLR-4302, apply over trunk 1404975 Improve CoreAdmin STATUS request response time by allowing user to omit the Index info -- Key: SOLR-4302 URL: https://issues.apache.org/jira/browse/SOLR-4302 Project: Solr Issue Type: Improvement Components: multicore Affects Versions: 4.0, 4.1, 5.0 Reporter: Shahar Davidson Priority: Minor Labels: performance Attachments: SOLR-4302.patch In large multicore environments (hundreds+ of cores), the STATUS request may take a fair amount of time. It seems that the majority of time is spent retrieving the index related info. The suggested improvement allows the user to specify a parameter (indexInfo) that if 'false' then index related info (such as segmentCount, sizeInBytes, numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' (to maintain existing STATUS request behavior). For example, when tested on a given machine with 380+ solr cores, the full STATUS request took 800ms-900ms, whereas using indexInfo=false returned results in about 1ms-4ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: looking for package org.apache.lucene.analysis.standard
Thanks to everyone, I feel I'm getting somewhere, but not quite there yet. I currently have the below in my pom. When I change my import to: import org.apache.lucene.queryparser.classic.QueryParser; Eclipse says it can't find org.apache.lucene.queryparser however, the maven installer has no such issue. The maven installer, does however have an issue with this line: Analyzer analyzer = new StandardAnalyzer(); It says: cannot find symbol symbol : constructor StandardAnalyzer() location: class org.apache.lucene.analysis.standard.StandardAnalyzer Even though I have the import: import org.apache.lucene.analysis.standard.StandardAnalyzer; Which Eclipse has no issue with. I've cleaned my project and restarted Eclipse with no improvement to the differences shown by Eclipse and Maven. Any help much appreciated! Pom dependencies: dependency groupIdorg.apache.lucene/groupId artifactIdlucene-core/artifactId version4.0.0/version scopeprovided/scope /dependency dependency groupIdorg.apache.lucene/groupId artifactIdlucene-analyzers-common/artifactId version4.0.0/version scopeprovided/scope /dependency dependency groupIdorg.apache.lucene/groupId artifactIdlucene-queryparser/artifactId version4.0.0/version scopeprovided/scope /dependency -- View this message in context: http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033104.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552573#comment-13552573 ] Michael McCandless commented on LUCENE-4620: This change seemed to lose a bit of performance: look at 1/11/2013 on http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html But, that tests just one dimension (Date), with only 3 ords per doc, so I had assumed that this just wasn't enough ints being decoded to see the gains from this bulk decoding. So, I modified luceneutil to have more facets per doc (avg ~25 ords per doc across 9 dimensions; 2.5M unique ords), and the results are still slower: {noformat} TaskQPS base StdDevQPS comp StdDevPct diff HighTerm3.62 (2.5%)3.24 (1.0%) -10.5% ( -13% - -7%) MedTerm7.34 (1.7%)6.78 (0.9%) -7.6% ( -10% - -5%) LowTerm 14.92 (1.6%) 14.32 (1.2%) -4.0% ( -6% - -1%) PKLookup 181.47 (4.7%) 183.04 (5.3%)0.9% ( -8% - 11%) {noformat} This is baffling ... not sure what's up. I would expect some gains given that the micro-benchmark showed sizable decode improvements. It must somehow be that decode cost is a minor part of facet counting? (which is not a good sign!: it should be a big part of it...) Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4676) IndexReader.isCurrent race
[ https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-4676: --- Assignee: Simon Willnauer IndexReader.isCurrent race -- Key: LUCENE-4676 URL: https://issues.apache.org/jira/browse/LUCENE-4676 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Simon Willnauer Fix For: 4.1 Revision: 1431169 ant test -Dtestcase=TestNRTManager -Dtests.method=testThreadStarvationNoDeleteNRTReader -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII -Dtests.dups=500 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552594#comment-13552594 ] Shai Erera commented on LUCENE-4620: I'm baffled too. There is some overhead with the bulk API, in that it needs to {{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I believe that this growing should stabilize after few docs (i.e. the array becomes large enough). Still, every iteration checks if the array is large enough, so perhaps if we grow the IntsRef upfront (even if too much), we can remove the 'ifs'. SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it should just grow by buf.length / 4. VInt is more tricky, but to be on the safe side it can grow by buf.length, as at the minimum each value occupies only one byte. Some other decoders are trickier, but they are not in effect in your test above. But I must admit that I thought it's a no brainer that replacing an iterator API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} shows nice improvements already. And even if decoding values is not the major part of faceted search (which I doubt), we shouldn't see slowdowns? At the most we shouldn't see big wins? Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552594#comment-13552594 ] Shai Erera edited comment on LUCENE-4620 at 1/14/13 11:51 AM: -- I'm baffled too. There is some overhead with the bulk API, in that it needs to {{grow()}} the {{IntsRef}} (something it didn't need to do before). But I believe that this growing should stabilize after few docs (i.e. the array becomes large enough). Still, every iteration checks if the array is large enough, so perhaps if we grow the IntsRef upfront (even if too much), we can remove the 'ifs'. SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it should just grow by buf.length / 4. VInt is more tricky, but to be on the safe side it can grow by buf.length, as at the minimum each value occupies only one byte. Some other decoders are trickier, but they are not in effect in your test above. But I must admit that I thought it's a no brainer that replacing an iterator API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} shows nice improvements already. And even if decoding values is not the major part of faceted search (which I doubt), we shouldn't see slowdowns? At the most we shouldn't see big wins? was (Author: shaie): I'm baffled too. There is some overhead with the bulk API, in that it needs to {{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I believe that this growing should stabilize after few docs (i.e. the array becomes large enough). Still, every iteration checks if the array is large enough, so perhaps if we grow the IntsRef upfront (even if too much), we can remove the 'ifs'. SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it should just grow by buf.length / 4. VInt is more tricky, but to be on the safe side it can grow by buf.length, as at the minimum each value occupies only one byte. Some other decoders are trickier, but they are not in effect in your test above. But I must admit that I thought it's a no brainer that replacing an iterator API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} shows nice improvements already. And even if decoding values is not the major part of faceted search (which I doubt), we shouldn't see slowdowns? At the most we shouldn't see big wins? Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment
[ https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552599#comment-13552599 ] Commit Tag Bot commented on LUCENE-4683: [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revisionrevision=1432890 LUCENE-4683: Change Aggregator and CategoryListIterator to be per-segment Change Aggregator and CategoryListIterator to be per-segment Key: LUCENE-4683 URL: https://issues.apache.org/jira/browse/LUCENE-4683 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Attachments: LUCENE-4683.patch As another improvement, these two (mostly CategoryListIterator) should be per-segment. I've got a patch nearly ready, will post tomorrow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment
[ https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-4683. Resolution: Fixed Fix Version/s: 5.0 4.1 I ran tests few times and all was quiet. Committed to trunk and 4x (add CHANGES too). Change Aggregator and CategoryListIterator to be per-segment Key: LUCENE-4683 URL: https://issues.apache.org/jira/browse/LUCENE-4683 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4683.patch As another improvement, these two (mostly CategoryListIterator) should be per-segment. I've got a patch nearly ready, will post tomorrow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment
[ https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552601#comment-13552601 ] Commit Tag Bot commented on LUCENE-4683: [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revisionrevision=1432894 LUCENE-4683: Change Aggregator and CategoryListIterator to be per-segment Change Aggregator and CategoryListIterator to be per-segment Key: LUCENE-4683 URL: https://issues.apache.org/jira/browse/LUCENE-4683 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4683.patch As another improvement, these two (mostly CategoryListIterator) should be per-segment. I've got a patch nearly ready, will post tomorrow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4321) java.io.FilterReader considered harmful
[ https://issues.apache.org/jira/browse/LUCENE-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Lukanin updated LUCENE-4321: -- Attachment: NoRandomReadMockTokenizer.java I had to extend MockTokenizer, because I read the buffer completely to decide, what to do with the input (convert or not to something else). When you use different reading methods randomly, my tests don't pass. If you used the same method (may be different) for the complete input string, they would pass, but now the output string is messed up, becase some parts of the input are converted and some are not. java.io.FilterReader considered harmful --- Key: LUCENE-4321 URL: https://issues.apache.org/jira/browse/LUCENE-4321 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0-BETA Reporter: Robert Muir Fix For: 4.0, 5.0 Attachments: LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, NoRandomReadMockTokenizer.java See Dawid's email: http://find.searchhub.org/document/64b0a28c53faf39 Reader.java is fine, it has lots of methods like read(), read(char[]), read(CharBuffer), skip(), but these all have default implementations delegating to read(char[], int, int). Unfortunately FilterReader delegates too many unnecessary things such as read() and skip() in a broken way. It should have just left these alone. This can cause traps for someone upgrading because they have to override multiple methods, when read(char[], int, int) should be enough, and all Reader methods will then work correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info
[ https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar reassigned SOLR-4302: --- Assignee: Shalin Shekhar Mangar Improve CoreAdmin STATUS request response time by allowing user to omit the Index info -- Key: SOLR-4302 URL: https://issues.apache.org/jira/browse/SOLR-4302 Project: Solr Issue Type: Improvement Components: multicore Affects Versions: 4.0, 4.1, 5.0 Reporter: Shahar Davidson Assignee: Shalin Shekhar Mangar Priority: Minor Labels: performance Attachments: SOLR-4302.patch In large multicore environments (hundreds+ of cores), the STATUS request may take a fair amount of time. It seems that the majority of time is spent retrieving the index related info. The suggested improvement allows the user to specify a parameter (indexInfo) that if 'false' then index related info (such as segmentCount, sizeInBytes, numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' (to maintain existing STATUS request behavior). For example, when tested on a given machine with 380+ solr cores, the full STATUS request took 800ms-900ms, whereas using indexInfo=false returned results in about 1ms-4ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552612#comment-13552612 ] Shai Erera commented on LUCENE-4620: I made this change to VInt8IntDecoder instead of checking inside the loop: {code} int numValues = buf.length; // a value occupies at least 1 byte if (values.ints.length numValues) { values.grow(numValues); } {code} Ran EncodingSpeed again and compared the results. On average (4 datasets), VInt8 achieves a 0.69% speedup, DGap(VInt) 7.85% and Sorting(Unique(DGap(VInt))) 10.16%. The last one is the default Encoder, thought its decoder is only DGap(VInt), so I'm not sure why the difference between that run and the previous one with 7.85%. However, it does look like it speeds things up... Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4321) java.io.FilterReader considered harmful
[ https://issues.apache.org/jira/browse/LUCENE-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552613#comment-13552613 ] Robert Muir commented on LUCENE-4321: - Your charfilter is broken. java.io.FilterReader considered harmful --- Key: LUCENE-4321 URL: https://issues.apache.org/jira/browse/LUCENE-4321 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.0-BETA Reporter: Robert Muir Fix For: 4.0, 5.0 Attachments: LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, NoRandomReadMockTokenizer.java See Dawid's email: http://find.searchhub.org/document/64b0a28c53faf39 Reader.java is fine, it has lots of methods like read(), read(char[]), read(CharBuffer), skip(), but these all have default implementations delegating to read(char[], int, int). Unfortunately FilterReader delegates too many unnecessary things such as read() and skip() in a broken way. It should have just left these alone. This can cause traps for someone upgrading because they have to override multiple methods, when read(char[], int, int) should be enough, and all Reader methods will then work correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552614#comment-13552614 ] Michael McCandless commented on LUCENE-4682: I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and it was a big hit: * With NEXT: 59267794 bytes * Without NEXT: 82543993 bytes So FST would be ~39% larger if we remove NEXT ... however lookup sped up from 726 ns per lookup to 636 ns. But: we could get this speedup today, if we just fixed setting of a NEXT arc's target to be lazy instead. Today it's very costly for non-array arcs because we scan to the end of all nodes to set the target, even if the caller isn't going to use it, which is really ridiculous. I also tested delta-coding the arc target instead of the abs vInt we have today ... it wasn't a real test, instead I just measured how many bytes the vInt delta would be vs how many bytes the vInt abs it today, and the results were disappointing: * Abs vInt (what we do today): 26681349 bytes * Delta vInt: 25479316 bytes Which is surprising ... I guess we don't see much locality for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them. Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552619#comment-13552619 ] Robert Muir commented on LUCENE-4682: - {quote} So FST would be ~39% larger if we remove NEXT {quote} But according to your notes above, we have 28% waste for this (with a long output). Are we making the right tradeoff? {quote} Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? {quote} Or, not do it at all if we cant figure it out? The reversing holds back other improvements so benchmarking it by itself could be misleading. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info
[ https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552621#comment-13552621 ] Commit Tag Bot commented on SOLR-4302: -- [trunk commit] Shalin Shekhar Mangar http://svn.apache.org/viewvc?view=revisionrevision=1432901 SOLR-4302: New parameter 'indexInfo' (defaults to true) in CoreAdmin STATUS command can be used to omit index specific information Improve CoreAdmin STATUS request response time by allowing user to omit the Index info -- Key: SOLR-4302 URL: https://issues.apache.org/jira/browse/SOLR-4302 Project: Solr Issue Type: Improvement Components: multicore Affects Versions: 4.0, 4.1, 5.0 Reporter: Shahar Davidson Assignee: Shalin Shekhar Mangar Priority: Minor Labels: performance Attachments: SOLR-4302.patch In large multicore environments (hundreds+ of cores), the STATUS request may take a fair amount of time. It seems that the majority of time is spent retrieving the index related info. The suggested improvement allows the user to specify a parameter (indexInfo) that if 'false' then index related info (such as segmentCount, sizeInBytes, numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' (to maintain existing STATUS request behavior). For example, when tested on a given machine with 380+ solr cores, the full STATUS request took 800ms-900ms, whereas using indexInfo=false returned results in about 1ms-4ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552624#comment-13552624 ] Dawid Weiss commented on LUCENE-4682: - bq. I also tested delta-coding the arc target instead of the abs vInt we have today ... I did such experiments when I was working on that paper. Remember -- you don't publish negative results, unfortunately. Obviously I didn't try everything but: 1) NEXT was important, 2) the structure of the FST doesn't yield to easy local deltas; it's not easily separable and pointers typically jump all over. bq. Which is surprising ... I guess we don't see much locality for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them. Not really that surprising -- you encode common suffixes after all so most of them will appear in a properly sized sample. This is actually why the strategy of moving nodes around works too -- you move those that are super frequent but they'll most likely be reordered at the top suffix frequencies of the automaton anyway. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info
[ https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar resolved SOLR-4302. - Resolution: Fixed Fix Version/s: 5.0 4.1 Committed to trunk and branch_4x. Thanks Shahar! Improve CoreAdmin STATUS request response time by allowing user to omit the Index info -- Key: SOLR-4302 URL: https://issues.apache.org/jira/browse/SOLR-4302 Project: Solr Issue Type: Improvement Components: multicore Affects Versions: 4.0, 4.1, 5.0 Reporter: Shahar Davidson Assignee: Shalin Shekhar Mangar Priority: Minor Labels: performance Fix For: 4.1, 5.0 Attachments: SOLR-4302.patch In large multicore environments (hundreds+ of cores), the STATUS request may take a fair amount of time. It seems that the majority of time is spent retrieving the index related info. The suggested improvement allows the user to specify a parameter (indexInfo) that if 'false' then index related info (such as segmentCount, sizeInBytes, numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' (to maintain existing STATUS request behavior). For example, when tested on a given machine with 380+ solr cores, the full STATUS request took 800ms-900ms, whereas using indexInfo=false returned results in about 1ms-4ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info
[ https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552628#comment-13552628 ] Commit Tag Bot commented on SOLR-4302: -- [branch_4x commit] Shalin Shekhar Mangar http://svn.apache.org/viewvc?view=revisionrevision=1432903 SOLR-4302: New parameter 'indexInfo' (defaults to true) in CoreAdmin STATUS command can be used to omit index specific information Improve CoreAdmin STATUS request response time by allowing user to omit the Index info -- Key: SOLR-4302 URL: https://issues.apache.org/jira/browse/SOLR-4302 Project: Solr Issue Type: Improvement Components: multicore Affects Versions: 4.0, 4.1, 5.0 Reporter: Shahar Davidson Assignee: Shalin Shekhar Mangar Priority: Minor Labels: performance Fix For: 4.1, 5.0 Attachments: SOLR-4302.patch In large multicore environments (hundreds+ of cores), the STATUS request may take a fair amount of time. It seems that the majority of time is spent retrieving the index related info. The suggested improvement allows the user to specify a parameter (indexInfo) that if 'false' then index related info (such as segmentCount, sizeInBytes, numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' (to maintain existing STATUS request behavior). For example, when tested on a given machine with 380+ solr cores, the full STATUS request took 800ms-900ms, whereas using indexInfo=false returned results in about 1ms-4ms. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552633#comment-13552633 ] Michael McCandless commented on LUCENE-4682: {quote} bq. So FST would be ~39% larger if we remove NEXT But according to your notes above, we have 28% waste for this (with a long output). Are we making the right tradeoff? {quote} Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today. {quote} bq. Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? Or, not do it at all if we cant figure it out? The reversing holds back other improvements so benchmarking it by itself could be misleading. {quote} I don't think we should drop NEXT unless we have some alternative? 39% increase is size is non-trivial! I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back? Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4570) release policeman tools?
[ https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552635#comment-13552635 ] Uwe Schindler commented on LUCENE-4570: --- I started a google code project: http://code.google.com/p/forbidden-apis/ This is a fork with many new additions: - auto-generated deprecated signature list (from rt.jar) - support for bundled and project-maintained signature lists (like the deprecated ones for various JDK versions, the well known charset/locale/... violators) - no direct ASM 4.1 dependency conflicting with other dependencies: The ASM library is jarjar'ed into the artifact - _not yet:_ Comments for every signature thats printed in error message - _not yet:_ Mäven support (Mojo) Once there is a release (hopefully soon) release policeman tools? Key: LUCENE-4570 URL: https://issues.apache.org/jira/browse/LUCENE-4570 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Currently there is source code in lucene/tools/src (e.g. Forbidden APIs checker ant task). It would be convenient if you could download this thing in your ant build from ivy (especially if maybe it included our definitions .txt files as resources). In general checking for locale/charset violations in this way is a pretty general useful thing for a server-side app. Can we either release lucene-tools.jar as an artifact, or maybe alternatively move this somewhere else as a standalone project and suck it in ourselves? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4570) release policeman tools?
[ https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552635#comment-13552635 ] Uwe Schindler edited comment on LUCENE-4570 at 1/14/13 1:05 PM: I started a google code project: http://code.google.com/p/forbidden-apis/ This is a fork with many new additions: - auto-generated deprecated signature list (from rt.jar) - support for bundled and project-maintained signature lists (like the deprecated ones for various JDK versions, the well known charset/locale/... violators) - no direct ASM 4.1 dependency conflicting with other dependencies: The ASM library is jarjar'ed into the artifact - _not yet:_ Comments for every signature thats printed in error message - _not yet:_ Mäven support (Mojo) - Selckin already started a fork in Github, but as the new project is almost a complete rewrite of the API (decouple ANT task from logic), I will need his help - _not yet:_ Mäven Release, so IVY can download it Once there is a release (hopefully soon), this can ivy:cachepath'ed and taskdef'ed into the Lucene build was (Author: thetaphi): I started a google code project: http://code.google.com/p/forbidden-apis/ This is a fork with many new additions: - auto-generated deprecated signature list (from rt.jar) - support for bundled and project-maintained signature lists (like the deprecated ones for various JDK versions, the well known charset/locale/... violators) - no direct ASM 4.1 dependency conflicting with other dependencies: The ASM library is jarjar'ed into the artifact - _not yet:_ Comments for every signature thats printed in error message - _not yet:_ Mäven support (Mojo) Once there is a release (hopefully soon) release policeman tools? Key: LUCENE-4570 URL: https://issues.apache.org/jira/browse/LUCENE-4570 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Currently there is source code in lucene/tools/src (e.g. Forbidden APIs checker ant task). It would be convenient if you could download this thing in your ant build from ivy (especially if maybe it included our definitions .txt files as resources). In general checking for locale/charset violations in this way is a pretty general useful thing for a server-side app. Can we either release lucene-tools.jar as an artifact, or maybe alternatively move this somewhere else as a standalone project and suck it in ourselves? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552641#comment-13552641 ] Robert Muir commented on LUCENE-4682: - {quote} Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today. {quote} {quote} I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back? {quote} I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers :) Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction (LUCENE-4593). But fortunately you did that anyway... It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4570) release policeman tools?
[ https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552644#comment-13552644 ] Dawid Weiss commented on LUCENE-4570: - Nice! release policeman tools? Key: LUCENE-4570 URL: https://issues.apache.org/jira/browse/LUCENE-4570 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Currently there is source code in lucene/tools/src (e.g. Forbidden APIs checker ant task). It would be convenient if you could download this thing in your ant build from ivy (especially if maybe it included our definitions .txt files as resources). In general checking for locale/charset violations in this way is a pretty general useful thing for a server-side app. Can we either release lucene-tools.jar as an artifact, or maybe alternatively move this somewhere else as a standalone project and suck it in ourselves? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: looking for package org.apache.lucene.analysis.standard
Hi Jim, Try getting rid of the scopeprovided/scope lines. Steve On Jan 14, 2013 5:38 AM, JimAld jim.alder...@db.com wrote: Thanks to everyone, I feel I'm getting somewhere, but not quite there yet. I currently have the below in my pom. When I change my import to: import org.apache.lucene.queryparser.classic.QueryParser; Eclipse says it can't find org.apache.lucene.queryparser however, the maven installer has no such issue. The maven installer, does however have an issue with this line: Analyzer analyzer = new StandardAnalyzer(); It says: cannot find symbol symbol : constructor StandardAnalyzer() location: class org.apache.lucene.analysis.standard.StandardAnalyzer Even though I have the import: import org.apache.lucene.analysis.standard.StandardAnalyzer; Which Eclipse has no issue with. I've cleaned my project and restarted Eclipse with no improvement to the differences shown by Eclipse and Maven. Any help much appreciated! Pom dependencies: dependency groupIdorg.apache.lucene/groupId artifactIdlucene-core/artifactId version4.0.0/version scopeprovided/scope /dependency dependency groupIdorg.apache.lucene/groupId artifactIdlucene-analyzers-common/artifactId version4.0.0/version scopeprovided/scope /dependency dependency groupIdorg.apache.lucene/groupId artifactIdlucene-queryparser/artifactId version4.0.0/version scopeprovided/scope /dependency -- View this message in context: http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033104.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552653#comment-13552653 ] Michael McCandless commented on LUCENE-4682: bq. I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers LOL :) But yeah I agree. bq. Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction (LUCENE-4593). But fortunately you did that anyway... Right but it's not good if bus factor is 1 ... it's effectively dead code when that happens. bq. It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc. +1, except that NEXT buys us a much smaller FST now. We can't just drop it ... we need some way to simplify it (eg somehow stop reversing). Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-4620: --- Attachment: LUCENE-4620.patch Maybe doing bulk-vInt-decode (see patch) will be faster (just make hotspot's job easier) ... I'll test. Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB
[ https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552656#comment-13552656 ] Michael McCandless commented on LUCENE-3298: bq. The impact will show on 32-bit systems I'm pretty sure of that. Yeah I think it will too ... bq. We don't care about hardware archaeology, do we? I think Lucene should continue to run on 32 bit hardware, but I don't think performance on 32 bit is important, ie we should optimize for 64 bit performance. FST has hard limit max size of 2.1 GB - Key: LUCENE-3298 URL: https://issues.apache.org/jira/browse/LUCENE-3298 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch The FST uses a single contiguous byte[] under the hood, which in java is indexed by int so we cannot grow this over Integer.MAX_VALUE. It also internally encodes references to this array as vInt. We could switch this to a paged byte[] and make the far larger. But I think this is low priority... I'm not going to work on it any time soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4570) release policeman tools?
[ https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-4570: - Assignee: Uwe Schindler release policeman tools? Key: LUCENE-4570 URL: https://issues.apache.org/jira/browse/LUCENE-4570 Project: Lucene - Core Issue Type: New Feature Reporter: Robert Muir Assignee: Uwe Schindler Currently there is source code in lucene/tools/src (e.g. Forbidden APIs checker ant task). It would be convenient if you could download this thing in your ant build from ivy (especially if maybe it included our definitions .txt files as resources). In general checking for locale/charset violations in this way is a pretty general useful thing for a server-side app. Can we either release lucene-tools.jar as an artifact, or maybe alternatively move this somewhere else as a standalone project and suck it in ourselves? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552668#comment-13552668 ] Shai Erera commented on LUCENE-4620: I see. I have two comments about the patch. This part is wrong: {code} +int needed = upto - buf.offset; +if (values.length needed) { + values.grow(needed); +} {code} should be {code} +if (values.ints.length buf.length) { + values.grow(buf.length); +} {code} Does it even run for you? because {{values.length = 0}} at start. Also, note how this way you check offset upto on every byte read while in the current code it's checked only once per integer read. Maybe if you do a while loop inside the loop, something like {{while (b 0)}}. Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552668#comment-13552668 ] Shai Erera edited comment on LUCENE-4620 at 1/14/13 2:10 PM: - I see. I have two comments about the patch. This part is wrong: {code} +int needed = upto - buf.offset; +if (values.length needed) { + values.grow(needed); +} {code} should be {code} +if (values.ints.length buf.length) { + values.grow(buf.length); +} {code} With your patch, values.grow() is always called, even if inside it doesn't do anything. I wonder if we should not {{grow()}} the array, but rather grow it from the outside ourselves. Because IntsRef.grow() checks the capacity again (and Robert is against grow() anyway...). Also, note how this way you check offset upto on every byte read while in the current code it's checked only once per integer read. Maybe if you do a while loop inside the loop, something like {{while (b 0)}}. was (Author: shaie): I see. I have two comments about the patch. This part is wrong: {code} +int needed = upto - buf.offset; +if (values.length needed) { + values.grow(needed); +} {code} should be {code} +if (values.ints.length buf.length) { + values.grow(buf.length); +} {code} Does it even run for you? because {{values.length = 0}} at start. Also, note how this way you check offset upto on every byte read while in the current code it's checked only once per integer read. Maybe if you do a while loop inside the loop, something like {{while (b 0)}}. Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race
[ https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552677#comment-13552677 ] Simon Willnauer commented on LUCENE-4676: - after looking at IW infostreams for a while I am convinced this is a test-bug (a pretty rare one I'd say). So what happens here is the following (applyDeletes=false): {noformat} 1. Thread[1] adds a doc (D1) 2. Thread[1] pull a new reader 3. Thread[1] adds another doc (D2) 3a. Thread[2] pull a new reader 3b. Thread[2] adds a del query 3c. Thread[2] pull a new reader 4. Thread[1] checks if reader is current {noformat} (3a - 3c are concurrent) given that we don't apply deletes on a NRT reader pull we should see _isCurrent == false_ Well this works most of the time unless there is a concurrent merge kicked off right after doc was added in _3_ that sees both flushed segments (D1 and D2) and subsequently tries to apply deletes to those segments. Here comes the problem, if the applyDeletes is fast enough (ie. reaches BufferedDeletesStream#prune()) before _4_ it drops the delete query from the streams (correct behavior!) but doesn't checkpoint since no segment was affected. If we check isCurrent now we see a _true_ value since the BufferedDeletesStream is empty (pruned) and the merge didn't finish yet (no checkpoint) which means the version of the SegmentInfos is the same. does this make sense? I switched over to NoMergePolicy on this test and tests pass all the time (500k times executed) while with a real MP it fails very quickly for me. IndexReader.isCurrent race -- Key: LUCENE-4676 URL: https://issues.apache.org/jira/browse/LUCENE-4676 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Simon Willnauer Fix For: 4.1 Revision: 1431169 ant test -Dtestcase=TestNRTManager -Dtests.method=testThreadStarvationNoDeleteNRTReader -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII -Dtests.dups=500 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552680#comment-13552680 ] Martijn van Groningen commented on LUCENE-3931: --- This makes sense to me. Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen reassigned LUCENE-3931: - Assignee: Martijn van Groningen Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Assignee: Martijn van Groningen Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-4016) Deduplication is broken by partial update
[ https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-4016: Attachment: SOLR-4016-disallow-partial-update.patch Patch which disallows partial updates on signature generating fields Deduplication is broken by partial update - Key: SOLR-4016 URL: https://issues.apache.org/jira/browse/SOLR-4016 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS Reporter: Joel Nothman Assignee: Shalin Shekhar Mangar Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch The SignatureUpdateProcessorFactory used (primarily?) for deduplication does not consider partial update semantics. The below uses the following solrconfig.xml excerpt: {noformat} updateRequestProcessorChain name=text_hash processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldtext_hash/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClasssolr.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain {noformat} Firstly, the processor treats {noformat}{set: value}{noformat} as a string and hashes it, instead of the value alone: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: {set: hello world' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:30}} ?xml version=1.0 encoding=UTF-8?responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashad48c7ad60ac22cc/strlong name=_version_1417247434224959488/long/doc/result /response $ $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: hello world}}}' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:27}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashb169c743d220da8d/strlong name=_version_141724802221564/long/doc/result /response {noformat} Note the different text_hash value. Secondly, when updating a field other than those used to create the signature (which I imagine is a more common use-case), the signature is recalculated from no values: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, title: {set: new title' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:39}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hash/strstr name=titlenew title/strlong name=_version_1417248120480202752/long/doc/result /response {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552686#comment-13552686 ] Tommaso Teofili commented on LUCENE-3931: - that's true for Italian as well. Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Assignee: Martijn van Groningen Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552687#comment-13552687 ] Steve Rowe commented on LUCENE-3931: Because ElisionFilter use is used by more than just French, the set of contractions was moved out of ElisionFilter (LUCENE-3884). The issue of missing French contractions has already been addressed, in LUCENE-4662. I didn't notice this issue - I would have resolved it when I resolved LUCENE-4662. So Martijn, unless there is some other reason to keep this issue open, I think it can be resolved as a duplicate. Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Assignee: Martijn van Groningen Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552692#comment-13552692 ] Steve Rowe commented on LUCENE-3931: bq. that's true for Italian as well. [ItalianAnalyzer|http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_0_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/it/ItalianAnalyzer.java?revision=1396952view=markup#l53] includes d in the list of contractions it gives to ElisionFilter. Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Assignee: Martijn van Groningen Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552729#comment-13552729 ] Tommaso Teofili commented on LUCENE-3931: - ok, thanks for clarifying Steve. Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Assignee: Martijn van Groningen Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen closed LUCENE-3931. - Resolution: Fixed Assignee: (was: Martijn van Groningen) Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552761#comment-13552761 ] Martijn van Groningen commented on LUCENE-3931: --- I see. I'll close it. Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Assignee: Martijn van Groningen Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter
[ https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552764#comment-13552764 ] David Pilato commented on LUCENE-3931: -- Thanks all! Adding d character to default ElisionFilter - Key: LUCENE-3931 URL: https://issues.apache.org/jira/browse/LUCENE-3931 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: David Pilato Priority: Trivial As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d character is used in french as an elision character. E.g.: déclaration d'espèce So, it would be useful to have it as a default elision token. {code:title=ElisionFilter.java|borderStyle=solid} private static final CharArraySet DEFAULT_ARTICLES = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( l, m, t, qu, n, s, j, d), true)); {code} HTH David. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4602) Use DocValues to store per-doc facet ord
[ https://issues.apache.org/jira/browse/LUCENE-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-4602: --- Component/s: modules/facet Use DocValues to store per-doc facet ord Key: LUCENE-4602 URL: https://issues.apache.org/jira/browse/LUCENE-4602 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Michael McCandless Attachments: LUCENE-4602.patch, LUCENE-4602.patch Spinoff from LUCENE-4600 DocValues can be used to hold the byte[] encoding all facet ords for the document, instead of payloads. I made a hacked up approximation of in-RAM DV (see CachedCountingFacetsCollector in the patch) and the gains were somewhat surprisingly large: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff HighTerm0.53 (0.9%)1.00 (2.5%) 87.3% ( 83% - 91%) LowTerm7.59 (0.6%) 26.75 (12.9%) 252.6% ( 237% - 267%) MedTerm3.35 (0.7%) 12.71 (9.0%) 279.8% ( 268% - 291%) {noformat} I didn't think payloads were THAT slow; I think it must be the advance implementation? We need to separately test on-disk DV to make sure it's at least on-par with payloads (but hopefully faster) and if so ... we should cutover facets to using DV. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
4.1 branch
For anyone with pending patches: I plan on branching for 4.1 at around 1:00pm US EST. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Possible bug in Solr SpellCheckComponent if more than one QueryConverter class is present
Jack, Did you test this to see if you could trigger this bug? But in any case, can you open a jira ticket so this won't fall under the radar? Even if the comment that was put here is true I guess we should minimally throw an exception, or use the first one and log a warning, maybe? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Sunday, January 13, 2013 1:24 PM To: Lucene/Solr Dev Subject: Possible bug in Solr SpellCheckComponent if more than one QueryConverter class is present Reading through the code for Solr SpellCheckComponent.java for 4.1, it looks like it neither complains nor defaults reasonably if more than on QueryConverter class is present in the Solr lib directories: MapString, QueryConverter queryConverters = new HashMapString, QueryConverter(); core.initPlugins(queryConverters,QueryConverter.class); //ensure that there is at least one query converter defined if (queryConverters.size() == 0) { LOG.info(No queryConverter defined, using default converter); queryConverters.put(queryConverter, new SpellingQueryConverter()); } //there should only be one if (queryConverters.size() == 1) { queryConverter = queryConverters.values().iterator().next(); IndexSchema schema = core.getSchema(); String fieldTypeName = (String) initParams.get(queryAnalyzerFieldType); FieldType fieldType = schema.getFieldTypes().get(fieldTypeName); Analyzer analyzer = fieldType == null ? new WhitespaceAnalyzer(core.getSolrConfig().luceneMatchVersion) : fieldType.getQueryAnalyzer(); //TODO: There's got to be a better way! Where's Spring when you need it? queryConverter.setAnalyzer(analyzer); } No else! And queryConverter is not initialized, except for that code path where there was zero or one QueryConverter class. -- Jack Krupansky
RE: looking for package org.apache.lucene.analysis.standard
Hi, Well I've managed to fix the issue, sort of, so thought I should summarise here for any others who stumble across this issue. The reason maven was not able to build the project is that an empty constructor for StandardAnalyzer does not exist, even though both Eclipse and jadclipse showed that it did exist when referencing the 4.0.0 library. Using the constructor that took the Version as a param fixed this issue and allowed Maven to build the project. The Eclipse package explorer now shows no errors, however the Eclipse code viewer is littered with them. I tried referencing the 3 Lucene jars (shown above) directly in the classpath and this fixed the errors Eclipse showed with regards to Lucene, however it introduced a load of new errors with the rest of the project - can't win! Anyhow, I can live without intelli-sense for this project, as long as maven builds, that's the main thing. Thanks to everyone for their replies to this post. -- View this message in context: http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033195.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race
[ https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552836#comment-13552836 ] Simon Willnauer commented on LUCENE-4676: - to visualize this again here is a commented Log from a failure: {panel} IW [Thread-623]: getReader took 2 msec CMS [Lucene Merge Thread #0]: merge thread: start TEST [Thread-623]: refresh after delete {color:red}== HERE WE REFRESH AFTER THE DEL BY QUERY{color} DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false IW [Thread-623]: nrtIsCurrent: infoVersion matches: true DW changes: true BD changes: true DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false IW [Thread-623]: nrtIsCurrent: infoVersion matches: true DW changes: true BD changes: true DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false IW [Thread-623]: flush at getReader DW [Thread-623]: Thread-623 startFullFlush DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false pendingChangesInFullFlush: false DWFC [Thread-623]: addFlushableState DocumentsWriterPerThread [pendingDeletes=gen=0, segment=null, aborting=false, numDocsInRAM=0, deleteQueue=DWDQ: [ generation: 3 ]] DW [Thread-623]: Thread-623: flush naked frozen global deletes {color:red}== HERE WE PUSH THE DEL BY QUERY TO THE BUFFERED DELETE STREAM{color} BD [Thread-623]: push deletes 1 deleted queries bytesUsed=32 delGen=4 packetCount=2 totBytesUsed=1056 DW [Thread-623]: flush: push buffered deletes: 1 deleted queries bytesUsed=32 BD [Lucene Merge Thread #0]: applyDeletes: infos=[_1(5.0):c1, _0(5.0):C1] packetCount=2 BD [Lucene Merge Thread #0]: seg=_1(5.0):c1 segGen=3 coalesced deletes=[CoalescedDeletes(termSets=1,queries=1)] newDelCount=0 BD [Lucene Merge Thread #0]: seg=_0(5.0):C1 segGen=1 coalesced deletes=[CoalescedDeletes(termSets=2,queries=1)] newDelCount=0 BD [Lucene Merge Thread #0]: applyDeletes took 0 msec {color:red}== THE MERGE KICKS IN{color} BD [Lucene Merge Thread #0]: prune sis=org.apache.lucene.index.SegmentInfos@6dfb8d2e minGen=5 packetCount=2 BD [Lucene Merge Thread #0]: pruneDeletes: prune 2 packets; 0 packets remain {color:red}== MERGE PRUNES AWAY THE PACKAGE{color} IW [Lucene Merge Thread #0]: merge seg=_2 _1(5.0):c1 _0(5.0):C1 IW [Lucene Merge Thread #0]: now merge merge=_1(5.0):c1 _0(5.0):C1 index=_0(5.0):C1 _1(5.0):c1 IW [Lucene Merge Thread #0]: merging _1(5.0):c1 _0(5.0):C1 IW [Thread-623]: don't apply deletes now delTermCount=0 bytesUsed=0 IW [Thread-623]: return reader version=6 reader=StandardDirectoryReader(:nrt _0(5.0):C1 _1(5.0):c1) DW [Thread-623]: Thread-623 finishFullFlush success=true IW [Thread-623]: getReader took 1 msec {color:red}== HERE WE ARE DONE REFRESHING AFTER THE DELETE -- DEL QUERY IS ALREADY GONE {color} IW [Lucene Merge Thread #0]: seg=_1(5.0):c1 no deletes IW [Lucene Merge Thread #0]: seg=_0(5.0):C1 no deletes TEST [TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]: done updating DW [TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false IW [TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]: nrtIsCurrent: infoVersion matches: true DW changes: false BD changes: false {color:red}== HERE WE ARE ASSERTING ON isCurrent == FALSE and FAIL!!{color} DW [TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false DW [TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false SM [Lucene Merge Thread #0]: merge store matchedCount=2 vs 2 {panel} IndexReader.isCurrent race -- Key: LUCENE-4676 URL: https://issues.apache.org/jira/browse/LUCENE-4676 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Simon Willnauer Fix For: 4.1
[JENKINS] Lucene-Solr-trunk-Linux (64bit/jdk1.7.0_10) - Build # 3772 - Failure!
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/3772/ Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC All tests passed Build Log: [...truncated 25890 lines...] -documentation-lint: [echo] checking for broken html... [jtidy] Checking for broken html (such as invalid tags)... [delete] Deleting directory /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/lucene/build/jtidy_tmp [echo] Checking for broken links... [exec] [exec] Crawl/parse... [exec] [exec] Verify... [exec] [exec] file:///build/docs/core/org/apache/lucene/analysis/package-summary.html [exec] BAD EXTERNAL LINK: http://lucene.apache.org/core/discussion.html [exec] [exec] Broken javadocs links were found! BUILD FAILED /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:60: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/lucene/build.xml:242: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/lucene/common-build.xml:1961: exec returned: 1 Total time: 37 minutes 5 seconds Build step 'Invoke Ant' marked build as failure Archiving artifacts Recording test results Description set: Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4676) IndexReader.isCurrent race
[ https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-4676: Attachment: LUCENE-4676.patch here is a patch to fix this test IndexReader.isCurrent race -- Key: LUCENE-4676 URL: https://issues.apache.org/jira/browse/LUCENE-4676 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Simon Willnauer Fix For: 4.1 Attachments: LUCENE-4676.patch Revision: 1431169 ant test -Dtestcase=TestNRTManager -Dtests.method=testThreadStarvationNoDeleteNRTReader -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII -Dtests.dups=500 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.
Mark Miller created SOLR-4303: - Summary: On replication, if the generation of the master is lower than the slave we need to force a full copy of the index. Key: SOLR-4303 URL: https://issues.apache.org/jira/browse/SOLR-4303 Project: Solr Issue Type: Bug Components: replication (java) Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.1, 5.0 Doesn't affect SolrCloud since it uses the 'force' option, but a regression in Solr 4.0 from 3X it appears. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.
[ https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552888#comment-13552888 ] Commit Tag Bot commented on SOLR-4303: -- [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revisionrevision=1432993 SOLR-4303: On replication, if the generation of the master is lower than the slave we need to force a full copy of the index. On replication, if the generation of the master is lower than the slave we need to force a full copy of the index. -- Key: SOLR-4303 URL: https://issues.apache.org/jira/browse/SOLR-4303 Project: Solr Issue Type: Bug Components: replication (java) Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.1, 5.0 Doesn't affect SolrCloud since it uses the 'force' option, but a regression in Solr 4.0 from 3X it appears. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.
[ https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-4303: -- Description: Doesn't affect SolrCloud since it uses the 'force' option, but a regression in Solr 4.0 from 3X it appears. See http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html was:Doesn't affect SolrCloud since it uses the 'force' option, but a regression in Solr 4.0 from 3X it appears. On replication, if the generation of the master is lower than the slave we need to force a full copy of the index. -- Key: SOLR-4303 URL: https://issues.apache.org/jira/browse/SOLR-4303 Project: Solr Issue Type: Bug Components: replication (java) Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.1, 5.0 Doesn't affect SolrCloud since it uses the 'force' option, but a regression in Solr 4.0 from 3X it appears. See http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.
[ https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552897#comment-13552897 ] Commit Tag Bot commented on SOLR-4303: -- [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revisionrevision=1432995 SOLR-4303: On replication, if the generation of the master is lower than the slave we need to force a full copy of the index. On replication, if the generation of the master is lower than the slave we need to force a full copy of the index. -- Key: SOLR-4303 URL: https://issues.apache.org/jira/browse/SOLR-4303 Project: Solr Issue Type: Bug Components: replication (java) Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.1, 5.0 Doesn't affect SolrCloud since it uses the 'force' option, but a regression in Solr 4.0 from 3X it appears. See http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.
[ https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved SOLR-4303. --- Resolution: Fixed On replication, if the generation of the master is lower than the slave we need to force a full copy of the index. -- Key: SOLR-4303 URL: https://issues.apache.org/jira/browse/SOLR-4303 Project: Solr Issue Type: Bug Components: replication (java) Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.1, 5.0 Doesn't affect SolrCloud since it uses the 'force' option, but a regression in Solr 4.0 from 3X it appears. See http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-4.x-Windows (64bit/jdk1.7.0_10) - Build # 2397 - Failure!
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Windows/2397/ Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC All tests passed Build Log: [...truncated 25765 lines...] -documentation-lint: [echo] checking for broken html... [jtidy] Checking for broken html (such as invalid tags)... [delete] Deleting directory C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\lucene\build\jtidy_tmp [echo] Checking for broken links... [exec] [exec] Crawl/parse... [exec] [exec] Verify... [exec] [exec] file:///build/docs/core/org/apache/lucene/analysis/package-summary.html [exec] BAD EXTERNAL LINK: http://lucene.apache.org/core/discussion.html [exec] [exec] Broken javadocs links were found! BUILD FAILED C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\build.xml:60: The following error occurred while executing this line: C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\lucene\build.xml:242: The following error occurred while executing this line: C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\lucene\common-build.xml:1960: exec returned: 1 Total time: 64 minutes 10 seconds Build step 'Invoke Ant' marked build as failure Archiving artifacts Recording test results Description set: Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Possible bug in Solr SpellCheckComponent if more than one QueryConverter class is present
I just tried, and it causes an NPE, kind of as I had expected. I’ll file the Jira. -- Jack Krupansky From: Dyer, James Sent: Monday, January 14, 2013 10:50 AM To: dev@lucene.apache.org Subject: RE: Possible bug in Solr SpellCheckComponent if more than one QueryConverter class is present Jack, Did you test this to see if you could trigger this bug? But in any case, can you open a jira ticket so this won't fall under the radar? Even if the comment that was put here is true I guess we should minimally throw an exception, or use the first one and log a warning, maybe? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Sunday, January 13, 2013 1:24 PM To: Lucene/Solr Dev Subject: Possible bug in Solr SpellCheckComponent if more than one QueryConverter class is present Reading through the code for Solr SpellCheckComponent.java for 4.1, it looks like it neither complains nor defaults reasonably if more than on QueryConverter class is present in the Solr lib directories: MapString, QueryConverter queryConverters = new HashMapString, QueryConverter(); core.initPlugins(queryConverters,QueryConverter.class); //ensure that there is at least one query converter defined if (queryConverters.size() == 0) { LOG.info(No queryConverter defined, using default converter); queryConverters.put(queryConverter, new SpellingQueryConverter()); } //there should only be one if (queryConverters.size() == 1) { queryConverter = queryConverters.values().iterator().next(); IndexSchema schema = core.getSchema(); String fieldTypeName = (String) initParams.get(queryAnalyzerFieldType); FieldType fieldType = schema.getFieldTypes().get(fieldTypeName); Analyzer analyzer = fieldType == null ? new WhitespaceAnalyzer(core.getSolrConfig().luceneMatchVersion) : fieldType.getQueryAnalyzer(); //TODO: There's got to be a better way! Where's Spring when you need it? queryConverter.setAnalyzer(analyzer); } No else! And queryConverter is not initialized, except for that code path where there was zero or one QueryConverter class. -- Jack Krupansky
[jira] [Created] (LUCENE-4684) Allow DirectSpellChecker to be extended
Martijn van Groningen created LUCENE-4684: - Summary: Allow DirectSpellChecker to be extended Key: LUCENE-4684 URL: https://issues.apache.org/jira/browse/LUCENE-4684 Project: Lucene - Core Issue Type: Improvement Components: modules/spellchecker Environment: Currently the suggestSimilar() that actually operates on the FuzzyTermy is private protected. Would be great if that would just be protected for extensions. Reporter: Martijn van Groningen Assignee: Martijn van Groningen Priority: Minor -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1433005 - in /lucene/dev/branches/branch_4x: ./ dev-tools/ dev-tools/scripts/checkJavadocLinks.py
Thanks Robert. - Steve On Jan 14, 2013, at 12:41 PM, rm...@apache.org wrote: Author: rmuir Date: Mon Jan 14 17:41:01 2013 New Revision: 1433005 URL: http://svn.apache.org/viewvc?rev=1433005view=rev Log: whitelist this link Modified: lucene/dev/branches/branch_4x/ (props changed) lucene/dev/branches/branch_4x/dev-tools/ (props changed) lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py Modified: lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py?rev=1433005r1=1433004r2=1433005view=diff == --- lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py (original) +++ lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py Mon Jan 14 17:41:01 2013 @@ -197,6 +197,9 @@ def checkAll(dirName): elif link.find('lucene.apache.org/java/docs/discussion.html') != -1: # OK pass +elif link.find('lucene.apache.org/core/discussion.html') != -1: + # OK + pass elif link.find('lucene.apache.org/solr/mirrors-solr-latest-redir.html') != -1: # OK pass - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4684) Allow DirectSpellChecker to be extended
[ https://issues.apache.org/jira/browse/LUCENE-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martijn van Groningen updated LUCENE-4684: -- Attachment: LUCENE-4684.patch Made all the fields of DirectSpellChecker protected, the suggestSimilar method and the ScoreTerm inner class. Allow DirectSpellChecker to be extended Key: LUCENE-4684 URL: https://issues.apache.org/jira/browse/LUCENE-4684 Project: Lucene - Core Issue Type: Improvement Components: modules/spellchecker Environment: Currently the suggestSimilar() that actually operates on the FuzzyTermy is private protected. Would be great if that would just be protected for extensions. Reporter: Martijn van Groningen Assignee: Martijn van Groningen Priority: Minor Attachments: LUCENE-4684.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2765) Optimize scanning in DocsEnum
[ https://issues.apache.org/jira/browse/LUCENE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2765: --- Fix Version/s: (was: 4.1) 4.2 Optimize scanning in DocsEnum - Key: LUCENE-2765 URL: https://issues.apache.org/jira/browse/LUCENE-2765 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.2 Attachments: LUCENE-2765.patch, LUCENE-2765.patch Similar to LUCENE-2761: when we call advance(), after skipping it scans, but this can be optimized better than calling nextDoc() like today {noformat} // scan for the rest: do { nextDoc(); } while (target doc); {noformat} in particular, the freq can be skipVinted and the skipDocs (deletedDocs) don't need to be checked during this scanning. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2832) on Windows 64-bit, maybe we should default to a better maxBBufSize in MMapDirectory
[ https://issues.apache.org/jira/browse/LUCENE-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2832: --- Fix Version/s: (was: 4.1) 4.2 on Windows 64-bit, maybe we should default to a better maxBBufSize in MMapDirectory --- Key: LUCENE-2832 URL: https://issues.apache.org/jira/browse/LUCENE-2832 Project: Lucene - Core Issue Type: Improvement Components: core/store Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.2 Attachments: LUCENE-2832.patch Currently the default max buffer size for MMapDirectory is 256MB on 32bit and Integer.MAX_VALUE on 64bit: {noformat} public static final int DEFAULT_MAX_BUFF = Constants.JRE_IS_64BIT ? Integer.MAX_VALUE : (256 * 1024 * 1024); {noformat} But, in windows on 64-bit, you are practically limited to 8TB. This can cause problems in extreme cases, such as: http://www.lucidimagination.com/search/document/7522ee54c46f9ca4/map_failed_at_getsearcher Perhaps it would be good to change this default such that its 256MB on 32Bit *OR* windows, but leave it at Integer.MAX_VALUE on other 64-bit and 64-bit (48-bit) systems. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-4016) Deduplication is broken by partial update
[ https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-4016: Attachment: SOLR-4016-disallow-partial-update.patch Patch with a better test. Deduplication is broken by partial update - Key: SOLR-4016 URL: https://issues.apache.org/jira/browse/SOLR-4016 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS Reporter: Joel Nothman Assignee: Shalin Shekhar Mangar Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: SOLR-4016-disallow-partial-update.patch, SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch The SignatureUpdateProcessorFactory used (primarily?) for deduplication does not consider partial update semantics. The below uses the following solrconfig.xml excerpt: {noformat} updateRequestProcessorChain name=text_hash processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldtext_hash/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClasssolr.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain {noformat} Firstly, the processor treats {noformat}{set: value}{noformat} as a string and hashes it, instead of the value alone: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: {set: hello world' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:30}} ?xml version=1.0 encoding=UTF-8?responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashad48c7ad60ac22cc/strlong name=_version_1417247434224959488/long/doc/result /response $ $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: hello world}}}' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:27}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashb169c743d220da8d/strlong name=_version_141724802221564/long/doc/result /response {noformat} Note the different text_hash value. Secondly, when updating a field other than those used to create the signature (which I imagine is a more common use-case), the signature is recalculated from no values: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, title: {set: new title' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:39}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hash/strstr name=titlenew title/strlong name=_version_1417248120480202752/long/doc/result /response {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2570) randomize indexwriter settings in solr tests
[ https://issues.apache.org/jira/browse/SOLR-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated SOLR-2570: - Fix Version/s: (was: 4.1) 4.2 randomize indexwriter settings in solr tests Key: SOLR-2570 URL: https://issues.apache.org/jira/browse/SOLR-2570 Project: Solr Issue Type: Test Components: Build Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.2 Attachments: SOLR-2570.patch we should randomize indexwriter settings like lucene tests do, to vary # of segments and such. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1674) improve analysis tests, cut over to new API
[ https://issues.apache.org/jira/browse/SOLR-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated SOLR-1674: - Fix Version/s: (was: 4.1) 4.2 improve analysis tests, cut over to new API --- Key: SOLR-1674 URL: https://issues.apache.org/jira/browse/SOLR-1674 Project: Solr Issue Type: Test Components: Schema and Analysis Reporter: Robert Muir Assignee: Robert Muir Fix For: 4.2 Attachments: SOLR-1674.patch, SOLR-1674.patch, SOLR-1674_speedup.patch This patch * converts all analysis tests to use the new tokenstream api * converts most tests to use the more stringent assertion mechanisms from lucene * adds new tests to improve coverage Most bugs found by more stringent testing have been fixed, with the exception of SynonymFilter. The problems with this filter are more serious, the previous tests were essentially a no-op. The new tests for SynonymFilter test the current behavior, but have FIXMEs with what I think the old test wanted to expect in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3459) Change ChainedFilter to use FixedBitSet
[ https://issues.apache.org/jira/browse/LUCENE-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3459: --- Fix Version/s: (was: 4.1) 4.2 Change ChainedFilter to use FixedBitSet --- Key: LUCENE-3459 URL: https://issues.apache.org/jira/browse/LUCENE-3459 Project: Lucene - Core Issue Type: Task Components: modules/other Affects Versions: 3.4, 4.0-ALPHA Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.2 ChainedFilter also uses OpenBitSet(DISI) at the moment. It should also be changed to use FixedBitSet. There are two issues: - It exposes sometimes OpenBitSetDISI to it's public API - we should remove those methods like in BooleanFilter and break backwards - It allows a XOR operation. This is not yet supported by FixedBitSet, but it's easy to add (like for BooleanFilter). On the other hand, this XOR operation is bogus, as it may mark documents in the BitSet that are deleted, breaking new features like applying Filters down-low (LUCENE-1536). We should remove the XOR operation maybe or force it to use IR.validDocs() (trunk) or IR.isDeleted() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3034) If you vary a setting per round and that setting is a long string, the report padding/columns break down.
[ https://issues.apache.org/jira/browse/LUCENE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3034: --- Fix Version/s: (was: 4.1) 4.2 If you vary a setting per round and that setting is a long string, the report padding/columns break down. - Key: LUCENE-3034 URL: https://issues.apache.org/jira/browse/LUCENE-3034 Project: Lucene - Core Issue Type: Improvement Components: modules/benchmark Reporter: Mark Miller Assignee: Mark Miller Priority: Trivial Fix For: 4.2 This is especially noticeable if you vary a setting where the value is a fully specified class name - in this case, it would be nice if columns in each row still lined up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3451: --- Fix Version/s: (was: 4.1) 4.2 Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery - Key: LUCENE-3451 URL: https://issues.apache.org/jira/browse/LUCENE-3451 Project: Lucene - Core Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.2 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows pure negative Filter clauses. This is not supported by BooleanQuery and confuses users (I think that's the problem in LUCENE-3450). The hack is buggy, as it does not respect deleted documents and returns them in its DocIdSet. Also we should think about disallowing pure-negative Queries at all and throw UOE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3968) Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens
[ https://issues.apache.org/jira/browse/LUCENE-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3968: --- Fix Version/s: (was: 4.1) 4.2 Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens - Key: LUCENE-3968 URL: https://issues.apache.org/jira/browse/LUCENE-3968 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.2 Attachments: LUCENE-3968.patch MockGraphTokenFilter is rather hairy... I've managed to simplify it (I think!) by breaking apart its two functions... I think LookaheadTokenFilter can be used in the future for other graph aware filters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4016) Deduplication is broken by partial update
[ https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552935#comment-13552935 ] Commit Tag Bot commented on SOLR-4016: -- [trunk commit] Shalin Shekhar Mangar http://svn.apache.org/viewvc?view=revisionrevision=1433013 SOLR-4016: Deduplication does not work with atomic/partial updates so disallow atomic update requests which change signature generating fields. Deduplication is broken by partial update - Key: SOLR-4016 URL: https://issues.apache.org/jira/browse/SOLR-4016 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS Reporter: Joel Nothman Assignee: Shalin Shekhar Mangar Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: SOLR-4016-disallow-partial-update.patch, SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch The SignatureUpdateProcessorFactory used (primarily?) for deduplication does not consider partial update semantics. The below uses the following solrconfig.xml excerpt: {noformat} updateRequestProcessorChain name=text_hash processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldtext_hash/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClasssolr.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain {noformat} Firstly, the processor treats {noformat}{set: value}{noformat} as a string and hashes it, instead of the value alone: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: {set: hello world' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:30}} ?xml version=1.0 encoding=UTF-8?responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashad48c7ad60ac22cc/strlong name=_version_1417247434224959488/long/doc/result /response $ $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: hello world}}}' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:27}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashb169c743d220da8d/strlong name=_version_141724802221564/long/doc/result /response {noformat} Note the different text_hash value. Secondly, when updating a field other than those used to create the signature (which I imagine is a more common use-case), the signature is recalculated from no values: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, title: {set: new title' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:39}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hash/strstr name=titlenew title/strlong name=_version_1417248120480202752/long/doc/result /response {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-4304) NPE in Solr SpellCheckComponent if more than one QueryConverter
Jack Krupansky created SOLR-4304: Summary: NPE in Solr SpellCheckComponent if more than one QueryConverter Key: SOLR-4304 URL: https://issues.apache.org/jira/browse/SOLR-4304 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 4.0 Reporter: Jack Krupansky The Solr SpellCheckComponent uses only a single QueryConverter, but fails with an NPE if more than one QueryConverter class is registered in solrconfig.xml. Repro: 1. Add to 4.0 example solrconfig.xml: queryConverter name=myQueryConverter-1 class=solr.SpellingQueryConverter/ queryConverter name=myQueryConverter-2 class=solr.SuggestQueryConverter/ 2. Perform a spellcheck request: curl http://localhost:8983/solr/spell?q=testindent=true; 3. Examine the NPE: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status500/int int name=QTime4/int /lst result name=response numFound=0 start=0 /result lst name=error str name=tracejava.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:136) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Unknown Source) /str int name=code500/int /lst /response Suggested resolution: Use the first QueryConverter, but give a warning that indicates the class name of the one being used. Alternatively, throw a nasty but informative exception indicating the true nature of the problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3424) Return sequence ids from IW update/delete/add/commit to allow total ordering outside of IW
[ https://issues.apache.org/jira/browse/LUCENE-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3424: --- Fix Version/s: (was: 4.1) 4.2 Return sequence ids from IW update/delete/add/commit to allow total ordering outside of IW -- Key: LUCENE-3424 URL: https://issues.apache.org/jira/browse/LUCENE-3424 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.2 Attachments: LUCENE-3424.patch Based on the discussion on the [mailing list|http://mail-archives.apache.org/mod_mbox/lucene-dev/201109.mbox/%3CCAAHmpki-h7LUZGCUX_rfFx=q5-YkLJei+piRG=oic8d1pnr...@mail.gmail.com%3E] IW should return sequence ids from update/delete/add and commit to allow ordering of events for consistent transaction logs and recovery. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4016) Deduplication is broken by partial update
[ https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552950#comment-13552950 ] Yonik Seeley commented on SOLR-4016: bq. I see why you suggested that. The signature is like a unique key and modifying it seems like a rare use-case. But, if we do go that way, we should throw an exception and explicitly disallow partial update of signature generating fields. There are different use-cases here. If the signature being generated was the unique key, then atomic updates should be able to proceed fine as long as the id field is specified (as should always be the case with atomic updates). Deduplication is broken by partial update - Key: SOLR-4016 URL: https://issues.apache.org/jira/browse/SOLR-4016 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS Reporter: Joel Nothman Assignee: Shalin Shekhar Mangar Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: SOLR-4016-disallow-partial-update.patch, SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch The SignatureUpdateProcessorFactory used (primarily?) for deduplication does not consider partial update semantics. The below uses the following solrconfig.xml excerpt: {noformat} updateRequestProcessorChain name=text_hash processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldtext_hash/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClasssolr.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain {noformat} Firstly, the processor treats {noformat}{set: value}{noformat} as a string and hashes it, instead of the value alone: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: {set: hello world' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:30}} ?xml version=1.0 encoding=UTF-8?responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashad48c7ad60ac22cc/strlong name=_version_1417247434224959488/long/doc/result /response $ $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: hello world}}}' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:27}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashb169c743d220da8d/strlong name=_version_141724802221564/long/doc/result /response {noformat} Note the different text_hash value. Secondly, when updating a field other than those used to create the signature (which I imagine is a more common use-case), the signature is recalculated from no values: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, title: {set: new title' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:39}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hash/strstr name=titlenew title/strlong name=_version_1417248120480202752/long/doc/result /response {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3069: --- Fix Version/s: (was: 4.1) 4.2 Lucene should have an entirely memory resident term dictionary -- Key: LUCENE-3069 URL: https://issues.apache.org/jira/browse/LUCENE-3069 Project: Lucene - Core Issue Type: Improvement Components: core/index, core/search Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Simon Willnauer Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.2 FST based TermDictionary has been a great improvement yet it still uses a delta codec file for scanning to terms. Some environments have enough memory available to keep the entire FST based term dict in memory. We should add a TermDictionary implementation that encodes all needed information for each term into the FST (custom fst.Output) and builds a FST from the entire term not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
[ https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3022: --- Fix Version/s: (was: 4.1) 4.2 DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect - Key: LUCENE-3022 URL: https://issues.apache.org/jira/browse/LUCENE-3022 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 2.9.4, 3.1 Reporter: Johann Höchtl Assignee: Robert Muir Priority: Minor Fix For: 4.2 Attachments: LUCENE-3022.patch, LUCENE-3022.patch Original Estimate: 5m Remaining Estimate: 5m When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange behaviour: The german word streifenbluse (blouse with stripes) was decompounded to streifen (stripe),reifen(tire) which makes no sense at all. I thought the flag onlyLongestMatch would fix this, because streifen is longer than reifen, but it had no effect. So I reviewed the sourcecode and found the problem: [code] protected void decomposeInternal(final Token token) { // Only words longer than minWordSize get processed if (token.length() this.minWordSize) { return; } char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer()); for (int i=0;itoken.length()-this.minSubwordSize;++i) { Token longestMatchToken=null; for (int j=this.minSubwordSize-1;jthis.maxSubwordSize;++j) { if(i+jtoken.length()) { break; } if(dictionary.contains(lowerCaseTermBuffer, i, j)) { if (this.onlyLongestMatch) { if (longestMatchToken!=null) { if (longestMatchToken.length()j) { longestMatchToken=createToken(i,j,token); } } else { longestMatchToken=createToken(i,j,token); } } else { tokens.add(createToken(i,j,token)); } } } if (this.onlyLongestMatch longestMatchToken!=null) { tokens.add(longestMatchToken); } } } [/code] should be changed to [code] protected void decomposeInternal(final Token token) { // Only words longer than minWordSize get processed if (token.termLength() this.minWordSize) { return; } char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer()); Token longestMatchToken=null; for (int i=0;itoken.termLength()-this.minSubwordSize;++i) { for (int j=this.minSubwordSize-1;jthis.maxSubwordSize;++j) { if(i+jtoken.termLength()) { break; } if(dictionary.contains(lowerCaseTermBuffer, i, j)) { if (this.onlyLongestMatch) { if (longestMatchToken!=null) { if (longestMatchToken.termLength()j) { longestMatchToken=createToken(i,j,token); } } else { longestMatchToken=createToken(i,j,token); } } else { tokens.add(createToken(i,j,token)); } } } } if (this.onlyLongestMatch longestMatchToken!=null) { tokens.add(longestMatchToken); } } [/code] So, that only the longest token is really indexed and the onlyLongestMatch Flag makes sense. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3797) 3xCodec should throw UOE if a DocValuesConsumer is pulled
[ https://issues.apache.org/jira/browse/LUCENE-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3797: --- Fix Version/s: (was: 4.1) 4.2 3xCodec should throw UOE if a DocValuesConsumer is pulled -- Key: LUCENE-3797 URL: https://issues.apache.org/jira/browse/LUCENE-3797 Project: Lucene - Core Issue Type: Improvement Components: core/codecs, core/index Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Simon Willnauer Fix For: 4.2 Attachments: LUCENE-3797.patch, LUCENE-3797.patch currently we just return null if a DVConsumer is pulled from 3.x which is trappy since it causes an NPE in DocFieldProcessor. We should rather throw a UOE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3252) Use single array in fixed straight bytes DocValues if possible
[ https://issues.apache.org/jira/browse/LUCENE-3252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-3252: --- Fix Version/s: (was: 4.1) 4.2 Use single array in fixed straight bytes DocValues if possible -- Key: LUCENE-3252 URL: https://issues.apache.org/jira/browse/LUCENE-3252 Project: Lucene - Core Issue Type: Improvement Components: core/search, core/store Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.2 Attachments: LUCENE-3252.patch FixedStraightBytesImpl currently uses a straight array only if the byte size is 1 per document we could further optimize this to use a single array if all the values fit in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1921) Absurdly large radius (miles) search fails to include entire earth
[ https://issues.apache.org/jira/browse/LUCENE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-1921: --- Fix Version/s: (was: 4.1) 4.2 Absurdly large radius (miles) search fails to include entire earth -- Key: LUCENE-1921 URL: https://issues.apache.org/jira/browse/LUCENE-1921 Project: Lucene - Core Issue Type: Bug Components: modules/spatial Affects Versions: 2.9 Reporter: Michael McCandless Assignee: Chris Male Priority: Minor Fix For: 4.2 Attachments: ASF.LICENSE.NOT.GRANTED--TEST-1921.patch Spinoff from LUCENE-1781. If you do a very large (eg 10 miles) radius search then the lat/lng bound box wraps around the entire earth and all points should be accepted. But this fails today (many points are rejected). It's easy to see the issue: edit TestCartesian, and insert a very large miles into either testRange or testGeoHashRange. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1710) Add byte/short to NumericUtils, NumericField and NumericRangeQuery
[ https://issues.apache.org/jira/browse/LUCENE-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-1710: --- Fix Version/s: (was: 4.1) 4.2 Add byte/short to NumericUtils, NumericField and NumericRangeQuery -- Key: LUCENE-1710 URL: https://issues.apache.org/jira/browse/LUCENE-1710 Project: Lucene - Core Issue Type: New Feature Components: core/index, core/search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.2 Although NumericRangeQuery will not profit much from trie-encoding short/byte fields (byte fields with e.g. precisionStep 8 would only create one precision), it may be good to have these two data types available with NumericField to be generally able to store them in prefix-encoded form in index. This is important for loading them into FieldCache where they require much less memory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2527) FieldCache.getTermsIndex should cache fasterButMoreRAM=true|false to the same cache key
[ https://issues.apache.org/jira/browse/LUCENE-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2527: --- Fix Version/s: (was: 4.1) 4.2 FieldCache.getTermsIndex should cache fasterButMoreRAM=true|false to the same cache key --- Key: LUCENE-2527 URL: https://issues.apache.org/jira/browse/LUCENE-2527 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.0-ALPHA Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.2 When we cutover FieldCache to use shared byte[] blocks, we added the boolean fasterButMoreRAM option, so you could tradeoff time/space. It defaults to true. The thinking is that an expert user, who wants to use false, could pre-populate FieldCache by loading the field with false, and then later when sorting on that field it'd use that same entry. But there's a bug -- when sorting, it then loads a 2nd entry with true. This is because the Entry.custom in FieldCache participates in equals/hashCode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1820) WildcardQueryNode to expose the positions of the wildcard characters, for easier use in processors and builders
[ https://issues.apache.org/jira/browse/LUCENE-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-1820: --- Fix Version/s: (was: 4.1) 4.2 WildcardQueryNode to expose the positions of the wildcard characters, for easier use in processors and builders --- Key: LUCENE-1820 URL: https://issues.apache.org/jira/browse/LUCENE-1820 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Luis Alves Assignee: Michael Busch Priority: Minor Fix For: 4.2 Change the WildcardQueryNode to expose the positions of the wildcard characters. This would allow the AllowLeadingWildcardProcessor not to have to knowledge about the wildcard chars * and ? and avoid double check again. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2735) First Cut at GroupVarInt with FixedIntBlockIndexInput / Output
[ https://issues.apache.org/jira/browse/LUCENE-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2735: --- Fix Version/s: (was: 4.1) 4.2 First Cut at GroupVarInt with FixedIntBlockIndexInput / Output -- Key: LUCENE-2735 URL: https://issues.apache.org/jira/browse/LUCENE-2735 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.0-ALPHA Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Fix For: 4.2 Attachments: LUCENE-2735_alt.patch, LUCENE-2735.patch, LUCENE-2735.patch, LUCENE-2735.patch I have hacked together a FixedIntBlockIndex impl with Group VarInt encoding - this does way worse than standard codec in benchmarks but I guess that is mainly due to the FixedIntBlockIndex limitations. Once LUCENE-2723 is in / or builds with trunk again I will update and run some tests. The isolated microbenchmark shows that there could be improvements over vint even in java though and I am sure we can make it faster impl. wise. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4016) Deduplication is broken by partial update
[ https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552967#comment-13552967 ] Shalin Shekhar Mangar commented on SOLR-4016: - bq. If the signature being generated was the unique key, then atomic updates should be able to proceed fine as long as the id field is specified (as should always be the case with atomic updates). The patch that I committed throws an exception if an atomic update request contains fields that are used to compute the signature. An atomic update request which does not modify the signature, proceeds as normal. This way we make sure that a document never contains a wrong signature. Do you agree that this is an acceptable compromise until a proper fix is in place? Deduplication is broken by partial update - Key: SOLR-4016 URL: https://issues.apache.org/jira/browse/SOLR-4016 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.0 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS Reporter: Joel Nothman Assignee: Shalin Shekhar Mangar Labels: 4.0.1_Candidate Fix For: 4.1, 5.0 Attachments: SOLR-4016-disallow-partial-update.patch, SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch The SignatureUpdateProcessorFactory used (primarily?) for deduplication does not consider partial update semantics. The below uses the following solrconfig.xml excerpt: {noformat} updateRequestProcessorChain name=text_hash processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldtext_hash/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClasssolr.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain {noformat} Firstly, the processor treats {noformat}{set: value}{noformat} as a string and hashes it, instead of the value alone: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: {set: hello world' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:30}} ?xml version=1.0 encoding=UTF-8?responselst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashad48c7ad60ac22cc/strlong name=_version_1417247434224959488/long/doc/result /response $ $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, text: hello world}}}' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:27}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hashb169c743d220da8d/strlong name=_version_141724802221564/long/doc/result /response {noformat} Note the different text_hash value. Secondly, when updating a field other than those used to create the signature (which I imagine is a more common use-case), the signature is recalculated from no values: {noformat} $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d '{add:{doc:{id: abcde, title: {set: new title' curl '$URL/select?q=id:abcde' {responseHeader:{status:0,QTime:39}} ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime1/intlst name=paramsstr name=qid:abcde/str/lst/lstresult name=response numFound=1 start=0docstr name=idabcde/strstr name=texthello world/strstr name=text_hash/strstr name=titlenew title/strlong name=_version_1417248120480202752/long/doc/result /response {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 4.1 branch
branches/lucene_solr_4_1/ is open for business! I'm going to change version strings in branch_4x from 4.1 to 4.2 now. Steve On Jan 14, 2013, at 10:13 AM, Steve Rowe sar...@gmail.com wrote: For anyone with pending patches: I plan on branching for 4.1 at around 1:00pm US EST. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB
[ https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552968#comment-13552968 ] Commit Tag Bot commented on LUCENE-3298: [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revisionrevision=1433026 LUCENE-3298: FSTs can now be larger than 2GB, have more than 2B nodes FST has hard limit max size of 2.1 GB - Key: LUCENE-3298 URL: https://issues.apache.org/jira/browse/LUCENE-3298 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch The FST uses a single contiguous byte[] under the hood, which in java is indexed by int so we cannot grow this over Integer.MAX_VALUE. It also internally encodes references to this array as vInt. We could switch this to a paged byte[] and make the far larger. But I think this is low priority... I'm not going to work on it any time soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4304) NPE in Solr SpellCheckComponent if more than one QueryConverter
[ https://issues.apache.org/jira/browse/SOLR-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552969#comment-13552969 ] Jack Krupansky commented on SOLR-4304: -- The issue is in this code: {code} MapString, QueryConverter queryConverters = new HashMapString, QueryConverter(); core.initPlugins(queryConverters,QueryConverter.class); //ensure that there is at least one query converter defined if (queryConverters.size() == 0) { LOG.info(No queryConverter defined, using default converter); queryConverters.put(queryConverter, new SpellingQueryConverter()); } //there should only be one if (queryConverters.size() == 1) { queryConverter = queryConverters.values().iterator().next(); IndexSchema schema = core.getSchema(); String fieldTypeName = (String) initParams.get(queryAnalyzerFieldType); FieldType fieldType = schema.getFieldTypes().get(fieldTypeName); Analyzer analyzer = fieldType == null ? new WhitespaceAnalyzer(core.getSolrConfig().luceneMatchVersion) : fieldType.getQueryAnalyzer(); //TODO: There's got to be a better way! Where's Spring when you need it? queryConverter.setAnalyzer(analyzer); } {code} No else! And queryConverter is not initialized, except for that code path where there was zero or one QueryConverter class. NPE in Solr SpellCheckComponent if more than one QueryConverter --- Key: SOLR-4304 URL: https://issues.apache.org/jira/browse/SOLR-4304 Project: Solr Issue Type: Bug Components: spellchecker Affects Versions: 4.0 Reporter: Jack Krupansky The Solr SpellCheckComponent uses only a single QueryConverter, but fails with an NPE if more than one QueryConverter class is registered in solrconfig.xml. Repro: 1. Add to 4.0 example solrconfig.xml: queryConverter name=myQueryConverter-1 class=solr.SpellingQueryConverter/ queryConverter name=myQueryConverter-2 class=solr.SuggestQueryConverter/ 2. Perform a spellcheck request: curl http://localhost:8983/solr/spell?q=testindent=true; 3. Examine the NPE: ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status500/int int name=QTime4/int /lst result name=response numFound=0 start=0 /result lst name=error str name=tracejava.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:136) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890) at
[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-4620: --- Attachment: LUCENE-4620.patch Patch, fixing that bug Shai found. Performance is better with this specialization: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 192.61 (4.5%) 193.06 (4.2%) 0.2% ( -8% -9%) LowTerm 15.33 (1.6%) 15.44 (2.5%) 0.7% ( -3% -4%) MedTerm7.60 (0.7%)7.74 (1.8%) 1.9% ( 0% -4%) HighTerm3.85 (0.6%)3.97 (1.2%) 3.1% ( 1% -4%) {noformat} I also tried the unrolling of the vInt loop but perf was strangely quite a bit worse.. Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552978#comment-13552978 ] Michael McCandless commented on LUCENE-4620: I think we should just make a specialized accumulator/aggregator, for the counts-only-dgap-vint case: that could wouldn't need to populate an IntsRef and then make 2nd pass over the ords ... it'd just increment the count for each ord as it decodes in. In previous issues I already tested that this gives a good gain ... Explore IntEncoder/Decoder bulk API --- Key: LUCENE-4620 URL: https://issues.apache.org/jira/browse/LUCENE-4620 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 4.1, 5.0 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) and decode(int). Originally, we believed that this layer can be useful for other scenarios, but in practice it's used only for writing/reading the category ordinals from payload/DV. Therefore, Mike and I would like to explore a bulk API, something like encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder can still be streaming (as we don't know in advance how many ints will be written), dunno. Will figure this out as we go. One thing to check is whether the bulk API can work w/ e.g. facet associations, which can write arbitrary byte[], and so may decoding to an IntsRef won't make sense. This too we'll figure out as we go. I don't rule out that associations will use a different bulk API. At the end of the day, the requirement is for someone to be able to configure how ordinals are written (i.e. different encoding schemes: VInt, PackedInts etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race
[ https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552994#comment-13552994 ] Michael McCandless commented on LUCENE-4676: This explanation makes perfect sense! Thanks for digging Simon. +1 to just use NoMergePolicy. IndexReader.isCurrent race -- Key: LUCENE-4676 URL: https://issues.apache.org/jira/browse/LUCENE-4676 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Simon Willnauer Fix For: 4.1 Attachments: LUCENE-4676.patch Revision: 1431169 ant test -Dtestcase=TestNRTManager -Dtests.method=testThreadStarvationNoDeleteNRTReader -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII -Dtests.dups=500 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: 4.1 branch
On Jan 14, 2013, at 1:33 PM, Steve Rowe sar...@gmail.com wrote: branches/lucene_solr_4_1/ is open for business! I'm going to change version strings in branch_4x from 4.1 to 4.2 now. Done. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race
[ https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553044#comment-13553044 ] Commit Tag Bot commented on LUCENE-4676: [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revisionrevision=1433079 LUCENE-4676: Use NoMergePolicy in starvation test to prevent buffered deletes pruning IndexReader.isCurrent race -- Key: LUCENE-4676 URL: https://issues.apache.org/jira/browse/LUCENE-4676 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Simon Willnauer Fix For: 4.1 Attachments: LUCENE-4676.patch Revision: 1431169 ant test -Dtestcase=TestNRTManager -Dtests.method=testThreadStarvationNoDeleteNRTReader -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII -Dtests.dups=500 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2592) Custom Hashing
[ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553058#comment-13553058 ] Commit Tag Bot commented on SOLR-2592: -- [trunk commit] Yonik Seeley http://svn.apache.org/viewvc?view=revisionrevision=1433082 SOLR-2592: changes entry for doc routing Custom Hashing -- Key: SOLR-2592 URL: https://issues.apache.org/jira/browse/SOLR-2592 Project: Solr Issue Type: New Feature Components: SolrCloud Affects Versions: 4.0-ALPHA Reporter: Noble Paul Assignee: Yonik Seeley Fix For: 4.1 Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch, SOLR-2592_collectionProperties.patch, SOLR-2592_collectionProperties.patch, SOLR-2592.patch, SOLR-2592_progress.patch, SOLR-2592_query_try1.patch, SOLR-2592_r1373086.patch, SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2592) Custom Hashing
[ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley resolved SOLR-2592. Resolution: Fixed Fix Version/s: 5.0 Custom Hashing -- Key: SOLR-2592 URL: https://issues.apache.org/jira/browse/SOLR-2592 Project: Solr Issue Type: New Feature Components: SolrCloud Affects Versions: 4.0-ALPHA Reporter: Noble Paul Assignee: Yonik Seeley Fix For: 4.1, 5.0 Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch, SOLR-2592_collectionProperties.patch, SOLR-2592_collectionProperties.patch, SOLR-2592.patch, SOLR-2592_progress.patch, SOLR-2592_query_try1.patch, SOLR-2592_r1373086.patch, SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2894) Implement distributed pivot faceting
[ https://issues.apache.org/jira/browse/SOLR-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Russell updated SOLR-2894: Attachment: SOLR-2894.patch Corrected null aggregation issues when docs contain null values for fields pivoting on. Added logic to remove local params from pivot QS vars when determining over-request. Implement distributed pivot faceting Key: SOLR-2894 URL: https://issues.apache.org/jira/browse/SOLR-2894 Project: Solr Issue Type: Improvement Reporter: Erik Hatcher Fix For: 4.2, 5.0 Attachments: SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894-reworked.patch Following up on SOLR-792, pivot faceting currently only supports undistributed mode. Distributed pivot faceting needs to be implemented. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3858) Doc-to-shard assignment based on range property on shards
[ https://issues.apache.org/jira/browse/SOLR-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553064#comment-13553064 ] Yonik Seeley commented on SOLR-3858: SOLR-3755 took care of most of this, but the shard splitting code still needs to use the collection specific doc router. Doc-to-shard assignment based on range property on shards --- Key: SOLR-3858 URL: https://issues.apache.org/jira/browse/SOLR-3858 Project: Solr Issue Type: Sub-task Reporter: Yonik Seeley Anything that maps a document id to a shard should consult the ranges defined on the shards (currently indexing and real-time get). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2592) Custom Hashing
[ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553066#comment-13553066 ] Commit Tag Bot commented on SOLR-2592: -- [branch_4x commit] Yonik Seeley http://svn.apache.org/viewvc?view=revisionrevision=1433084 SOLR-2592: changes entry for doc routing Custom Hashing -- Key: SOLR-2592 URL: https://issues.apache.org/jira/browse/SOLR-2592 Project: Solr Issue Type: New Feature Components: SolrCloud Affects Versions: 4.0-ALPHA Reporter: Noble Paul Assignee: Yonik Seeley Fix For: 4.1, 5.0 Attachments: dbq_fix.patch, pluggable_sharding.patch, pluggable_sharding_V2.patch, SOLR-2592_collectionProperties.patch, SOLR-2592_collectionProperties.patch, SOLR-2592.patch, SOLR-2592_progress.patch, SOLR-2592_query_try1.patch, SOLR-2592_r1373086.patch, SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch If the data in a cloud can be partitioned on some criteria (say range, hash, attribute value etc) It will be easy to narrow down the search to a smaller subset of shards and in effect can achieve more efficient search. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org