Hudson build is back to normal : Solr-3.x #105
See https://hudson.apache.org/hudson/job/Solr-3.x/105/changes - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2646) Iimplement the Military Grid Reference System for tiling
[ https://issues.apache.org/jira/browse/LUCENE-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910029#action_12910029 ] Lance Norskog commented on LUCENE-2646: --- From the wikipedia page: In the polar regions, a different convention is used. http://earth-info.nga.mil/GandG/publications/tm8358.1/tr83581f.html Iimplement the Military Grid Reference System for tiling Key: LUCENE-2646 URL: https://issues.apache.org/jira/browse/LUCENE-2646 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: Grant Ingersoll The current tile based system in Lucene is broken. We should standardize on a common way of labeling grids and provide that as an option. Based on previous conversations with Ryan McKinley and Chris Male, it seems the Military Grid Reference System (http://en.wikipedia.org/wiki/Military_grid_reference_system) is a good candidate for the replacement due to its standard use of metric tiles of increasing orders of magnitude (1, 10, 100, 1000, etc.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910058#action_12910058 ] Varun Gupta commented on SOLR-236: -- I am using the patch SOLR-1682 committed on trunk for field collapsing. It works great but gives problem when I include other components like Facet and Highlighter. Is there any workaround to use Highlight and Facet components along with grouping? Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, quasidistributed.additional.patch, SOLR-236-1_4_1.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard
[ https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2647: --- Attachment: LUCENE-2647.patch Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard --- Key: LUCENE-2647 URL: https://issues.apache.org/jira/browse/LUCENE-2647 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-2647.patch The terms dict components that current live under Standard codec (oal.index.codecs.standard.*) are in fact very generic, and in no way particular to the Standard codec. Already we have many other codecs (sep, fixed int block, var int block, pulsing, appending) that re-use the terms dict writer/reader components. So I'd like to move these out into oal.index.codecs, and rename them: * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader With this move we have a nice reusable terms dict impl. The terms index impl is still well-decoupled so eg we could [in theory] explore a variable gap terms index. Many codecs, I expect, don't need/want to implement their own terms dict There are no code/index format changes here, besides the renaming fixing all imports/usages of the renamed class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard
Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard --- Key: LUCENE-2647 URL: https://issues.apache.org/jira/browse/LUCENE-2647 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-2647.patch The terms dict components that current live under Standard codec (oal.index.codecs.standard.*) are in fact very generic, and in no way particular to the Standard codec. Already we have many other codecs (sep, fixed int block, var int block, pulsing, appending) that re-use the terms dict writer/reader components. So I'd like to move these out into oal.index.codecs, and rename them: * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader With this move we have a nice reusable terms dict impl. The terms index impl is still well-decoupled so eg we could [in theory] explore a variable gap terms index. Many codecs, I expect, don't need/want to implement their own terms dict There are no code/index format changes here, besides the renaming fixing all imports/usages of the renamed class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-3.x #116
On Wed, Sep 15, 2010 at 8:43 PM, Robert Muir rcm...@gmail.com wrote: I wonder if now that we vary these in the tests anyway, if we should consider commenting out the Localized/MultiCodec runners? We could keep them available (but not used) in case you want to quickly run a test under every single Locale/Codec +1 Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes
[ https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910072#action_12910072 ] Michael McCandless commented on LUCENE-2588: After I commit the simple renaming of standard codec's terms dicts (LUCENE-2647), I plan to make this suffix-stripping opto private to StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method that can alter the indexed term before it's written). Since StandardCodec hardwires the term sort to unicode order, the opto is safe there. In general, if a codec uses a different term sort (such as this test's codec) it's conceivable a different opto could apply. EG I think this test could prune suffix based on the term after the index term. But, it makes no sense to spend time exploring this until a real use case arrives... this is just a simple test to assert that a codec is in fact free to customize the sort order. Also, there are other fun optos we could explore w/ terms index. EG we could wiggle the index term selection a bit, so it wouldn't be fixed to every N, to try to find terms that are small after removing the useless suffix. Separately, we could choose index terms according to docFreq -- eg one simple policy would be to plant an index term on term X if either 1) term X's docFreq is over a threshold, or, 2) it's been N terms since the last indexed terms. This could be a powerful way to even further reduce RAM usage of the terms index, because it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are in fact fast to lookup. The low freq terms can afford a higher seek time since it'll be so fast to enum the docs. terms index should not store useless suffixes - Key: LUCENE-2588 URL: https://issues.apache.org/jira/browse/LUCENE-2588 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2588.patch, LUCENE-2588.patch This idea came up when discussing w/ Robert how to improve our terms index... The terms dict index today simply grabs whatever term was at a 0 mod 128 index (by default). But this is wasteful because you often don't need the suffix of the term at that point. EG if the 127th term is aa and the 128th (indexed) term is abcd123456789, instead of storing that full term you only need to store ab. The suffix is useless, and uses up RAM since we load the terms index into RAM. The patch is very simple. The optimization is particularly easy because terms are now byte[] and we sort in binary order. I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index (tii) file from 3.9 MB - 3.3 MB = 16% smaller (using StandardAnalyzer, indexing body field tokenized but title / date fields untokenized). I expect on noisier terms dicts, especially ones w/ bad terms accidentally indexed, that the savings will be even more. In the future we could do crazier things. EG there's no real reason why the indexed terms must be regular (every N terms), so, we could instead pick terms more carefully, say approximately every N, but favor terms that have a smaller net prefix. We can also index more sparsely in regions where the net docFreq is lowish, since we can afford somewhat higher seek+scan time to these terms since enuming their docs will be much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-3.x #116
On Thu, Sep 16, 2010 at 11:56 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Sep 15, 2010 at 8:43 PM, Robert Muir rcm...@gmail.com wrote: I wonder if now that we vary these in the tests anyway, if we should consider commenting out the Localized/MultiCodec runners? We could keep them available (but not used) in case you want to quickly run a test under every single Locale/Codec +1 Yep +1 Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard
[ https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910081#action_12910081 ] Simon Willnauer commented on LUCENE-2647: - Mike, I think renaming is a good idea - that might make things slightly easier for folks to play around with codec here are some comments on the naming: bq.StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader +1 bq. StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader What about TermsIndexWriter/ReaderBase since we started using that scheme with analyzers and the JDK uses that too. If we remove the abstractness one day the name is very miss-leading but the property of being a base class will likely remain. bq. SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader I really don't like Simple* its like Smart which makes me immediately feel itchy all over the place. What differentiates this from others? It is the default? maybe DefaultTermsIndexWriter/Reader? bq. StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader Again, what about PostingWriter/ReaderBase bq. StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader +1 Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard --- Key: LUCENE-2647 URL: https://issues.apache.org/jira/browse/LUCENE-2647 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-2647.patch The terms dict components that current live under Standard codec (oal.index.codecs.standard.*) are in fact very generic, and in no way particular to the Standard codec. Already we have many other codecs (sep, fixed int block, var int block, pulsing, appending) that re-use the terms dict writer/reader components. So I'd like to move these out into oal.index.codecs, and rename them: * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader With this move we have a nice reusable terms dict impl. The terms index impl is still well-decoupled so eg we could [in theory] explore a variable gap terms index. Many codecs, I expect, don't need/want to implement their own terms dict There are no code/index format changes here, besides the renaming fixing all imports/usages of the renamed class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes
[ https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910084#action_12910084 ] Simon Willnauer commented on LUCENE-2588: - {quote} After I commit the simple renaming of standard codec's terms dicts (LUCENE-2647), I plan to make this suffix-stripping opto private to StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method that can alter the indexed term before it's written). {quote} Mike what about factoring out a method like {code} protected short indexTermPrefixLen(BytesRef lastTerm, BytesRef currentTerm){ ... } {code} then we can simply override that method if there is a comparator which can not utilize / breaks this opto? terms index should not store useless suffixes - Key: LUCENE-2588 URL: https://issues.apache.org/jira/browse/LUCENE-2588 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2588.patch, LUCENE-2588.patch This idea came up when discussing w/ Robert how to improve our terms index... The terms dict index today simply grabs whatever term was at a 0 mod 128 index (by default). But this is wasteful because you often don't need the suffix of the term at that point. EG if the 127th term is aa and the 128th (indexed) term is abcd123456789, instead of storing that full term you only need to store ab. The suffix is useless, and uses up RAM since we load the terms index into RAM. The patch is very simple. The optimization is particularly easy because terms are now byte[] and we sort in binary order. I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index (tii) file from 3.9 MB - 3.3 MB = 16% smaller (using StandardAnalyzer, indexing body field tokenized but title / date fields untokenized). I expect on noisier terms dicts, especially ones w/ bad terms accidentally indexed, that the savings will be even more. In the future we could do crazier things. EG there's no real reason why the indexed terms must be regular (every N terms), so, we could instead pick terms more carefully, say approximately every N, but favor terms that have a smaller net prefix. We can also index more sparsely in regions where the net docFreq is lowish, since we can afford somewhat higher seek+scan time to these terms since enuming their docs will be much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard
[ https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910089#action_12910089 ] Michael McCandless commented on LUCENE-2647: bq. What about TermsIndexWriter/ReaderBase since we started using that scheme with analyzers and the JDK uses that too. OK I'll switch from Abstract* - *Base. {quote} bq. SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader I really don't like Simple* its like Smart which makes me immediately feel itchy all over the place. {quote} Heh OK. bq. What differentiates this from others? It is the default? maybe DefaultTermsIndexWriter/Reader? Well... there are no others yet! So, its is the default for now, but, I don't like baking that into its name... Lesse... so this one uses packed ints, to write the RAM image required at search time, so that at search time we just slurp in these pre-built images. While the index term selection policy is now fixed (every N), I think this may change with time (the policy should be easily separable from how the index terms are written). Though, since we haven't yet done that separation, maybe I simply name it FixedGapTermsIndexWriter/Reader? How's that? Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard --- Key: LUCENE-2647 URL: https://issues.apache.org/jira/browse/LUCENE-2647 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-2647.patch The terms dict components that current live under Standard codec (oal.index.codecs.standard.*) are in fact very generic, and in no way particular to the Standard codec. Already we have many other codecs (sep, fixed int block, var int block, pulsing, appending) that re-use the terms dict writer/reader components. So I'd like to move these out into oal.index.codecs, and rename them: * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader With this move we have a nice reusable terms dict impl. The terms index impl is still well-decoupled so eg we could [in theory] explore a variable gap terms index. Many codecs, I expect, don't need/want to implement their own terms dict There are no code/index format changes here, besides the renaming fixing all imports/usages of the renamed class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard
[ https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910095#action_12910095 ] Simon Willnauer commented on LUCENE-2647: - bq. ...FixedGapTermsIndexWriter/Reader? How's that? +1 Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard --- Key: LUCENE-2647 URL: https://issues.apache.org/jira/browse/LUCENE-2647 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 4.0 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: LUCENE-2647.patch The terms dict components that current live under Standard codec (oal.index.codecs.standard.*) are in fact very generic, and in no way particular to the Standard codec. Already we have many other codecs (sep, fixed int block, var int block, pulsing, appending) that re-use the terms dict writer/reader components. So I'd like to move these out into oal.index.codecs, and rename them: * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader With this move we have a nice reusable terms dict impl. The terms index impl is still well-decoupled so eg we could [in theory] explore a variable gap terms index. Many codecs, I expect, don't need/want to implement their own terms dict There are no code/index format changes here, besides the renaming fixing all imports/usages of the renamed class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-792) Tree Faceting Component
[ https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910116#action_12910116 ] Yonik Seeley commented on SOLR-792: --- 1.4.x is for bugfixes only. Tree Faceting Component --- Key: SOLR-792 URL: https://issues.apache.org/jira/browse/SOLR-792 Project: Solr Issue Type: New Feature Reporter: Erik Hatcher Assignee: Ryan McKinley Priority: Minor Attachments: SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch A component to do multi-level faceting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations
[ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910123#action_12910123 ] Michael McCandless commented on LUCENE-2575: bq. I'm not immediately sure what's reading the level at this end position of the byte[]. This is so that once we exhaust the slice and must allocate the next one we know what size (level + 1, ceiling'd) to make the next slice. Concurrent byte and int block implementations - Key: LUCENE-2575 URL: https://issues.apache.org/jira/browse/LUCENE-2575 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Realtime Branch Reporter: Jason Rutherglen Fix For: Realtime Branch Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch The current *BlockPool implementations aren't quite concurrent. We really need something that has a locking flush method, where flush is called at the end of adding a document. Once flushed, the newly written data would be available to all other reading threads (ie, postings etc). I'm not sure I understand the slices concept, it seems like it'd be easier to implement a seekable random access file like API. One'd seek to a given position, then read or write from there. The underlying management of byte arrays could then be hidden? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-3.x #116
Ok, I will create an issue. For starters we could comment out these runners (for example, if code does not work for a locale, it will fail 'eventually' due to the fact we pick a random one anyway). In the future maybe we could make this functionality easily triggerable with a -D, in case you want to run a test class/method/entire test suite under all Locales/Codecs On Thu, Sep 16, 2010 at 6:23 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Thu, Sep 16, 2010 at 11:56 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Sep 15, 2010 at 8:43 PM, Robert Muir rcm...@gmail.com wrote: I wonder if now that we vary these in the tests anyway, if we should consider commenting out the Localized/MultiCodec runners? We could keep them available (but not used) in case you want to quickly run a test under every single Locale/Codec +1 Yep +1 Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910131#action_12910131 ] Yonik Seeley commented on SOLR-236: --- bq. It works great but gives problem when I include other components like Facet and Highlighter. See the list of sub-tasks on this issue starting with SearchGrouping:. I fixed faceting yesterday - and I hope to fix highlighting and debugging today. Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: Next Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, quasidistributed.additional.patch, SOLR-236-1_4_1.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes
[ https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910140#action_12910140 ] Robert Muir commented on LUCENE-2588: - {quote} Also, there are other fun optos we could explore w/ terms index. EG we could wiggle the index term selection a bit, so it wouldn't be fixed to every N, to try to find terms that are small after removing the useless suffix. Separately, we could choose index terms according to docFreq - eg one simple policy would be to plant an index term on term X if either 1) term X's docFreq is over a threshold, or, 2) it's been N terms since the last indexed terms. This could be a powerful way to even further reduce RAM usage of the terms index, because it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are in fact fast to lookup. The low freq terms can afford a higher seek time since it'll be so fast to enum the docs. {quote} it would be great to come up with a heuristic that balances all 3 of these: Because selecting % 32 is silly if it would give you abracadabra when the previous term is a and a fudge would give you a smaller index term (of course it depends too, on what the next index term would be, and the docfreq optimization too). It sounds tricky, but right now we are just selecting index terms with no basis at all (essentially random). then we are trying to deal with bad selections by trimming wasted suffixes, etc. terms index should not store useless suffixes - Key: LUCENE-2588 URL: https://issues.apache.org/jira/browse/LUCENE-2588 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2588.patch, LUCENE-2588.patch This idea came up when discussing w/ Robert how to improve our terms index... The terms dict index today simply grabs whatever term was at a 0 mod 128 index (by default). But this is wasteful because you often don't need the suffix of the term at that point. EG if the 127th term is aa and the 128th (indexed) term is abcd123456789, instead of storing that full term you only need to store ab. The suffix is useless, and uses up RAM since we load the terms index into RAM. The patch is very simple. The optimization is particularly easy because terms are now byte[] and we sort in binary order. I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index (tii) file from 3.9 MB - 3.3 MB = 16% smaller (using StandardAnalyzer, indexing body field tokenized but title / date fields untokenized). I expect on noisier terms dicts, especially ones w/ bad terms accidentally indexed, that the savings will be even more. In the future we could do crazier things. EG there's no real reason why the indexed terms must be regular (every N terms), so, we could instead pick terms more carefully, say approximately every N, but favor terms that have a smaller net prefix. We can also index more sparsely in regions where the net docFreq is lowish, since we can afford somewhat higher seek+scan time to these terms since enuming their docs will be much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2646) Iimplement the Military Grid Reference System for tiling
[ https://issues.apache.org/jira/browse/LUCENE-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910157#action_12910157 ] David Smiley commented on LUCENE-2646: -- I hope you guys can make it to my session at LuceneRevolution at which I'll describe my geohash prefix filtering technique. I'm working on an open-source contribution but the public release process is slow at MITRE. I'm not yet employing a tiling technique but it's where I want to go. Iimplement the Military Grid Reference System for tiling Key: LUCENE-2646 URL: https://issues.apache.org/jira/browse/LUCENE-2646 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: Grant Ingersoll The current tile based system in Lucene is broken. We should standardize on a common way of labeling grids and provide that as an option. Based on previous conversations with Ryan McKinley and Chris Male, it seems the Military Grid Reference System (http://en.wikipedia.org/wiki/Military_grid_reference_system) is a good candidate for the replacement due to its standard use of metric tiles of increasing orders of magnitude (1, 10, 100, 1000, etc.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations
[ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910201#action_12910201 ] Jason Rutherglen commented on LUCENE-2575: -- bq. we know what size (level + 1, ceiling'd) to make the next slice. Thanks. In the midst of debugging last night I realized this. The next question is whether to remove it. Concurrent byte and int block implementations - Key: LUCENE-2575 URL: https://issues.apache.org/jira/browse/LUCENE-2575 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: Realtime Branch Reporter: Jason Rutherglen Fix For: Realtime Branch Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch The current *BlockPool implementations aren't quite concurrent. We really need something that has a locking flush method, where flush is called at the end of adding a document. Once flushed, the newly written data would be available to all other reading threads (ie, postings etc). I'm not sure I understand the slices concept, it seems like it'd be easier to implement a seekable random access file like API. One'd seek to a given position, then read or write from there. The underlying management of byte arrays could then be hidden? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2648) Allow PackedInts.ReaderIterator to advance more than one value
Allow PackedInts.ReaderIterator to advance more than one value -- Key: LUCENE-2648 URL: https://issues.apache.org/jira/browse/LUCENE-2648 Project: Lucene - Java Issue Type: Improvement Components: Other Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor The iterator-like API in LUCENE-2186 makes effective use of PackedInts.ReaderIterator but frequently skips multiple values. ReaderIterator currently requires to loop over ReaderInterator#next() to advance to a certain value. We should allow ReaderIterator to expose a #advance(ord) method to make use-cases like that more efficient. This issue is somewhat part of my efforts to make LUCENE-2186 smaller while breaking it up in little issues for parts which can be generally useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2648) Allow PackedInts.ReaderIterator to advance more than one value
[ https://issues.apache.org/jira/browse/LUCENE-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-2648: Attachment: LUCENE-2648.patch here is a patch - comments welcome Allow PackedInts.ReaderIterator to advance more than one value -- Key: LUCENE-2648 URL: https://issues.apache.org/jira/browse/LUCENE-2648 Project: Lucene - Java Issue Type: Improvement Components: Other Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Simon Willnauer Priority: Minor Attachments: LUCENE-2648.patch The iterator-like API in LUCENE-2186 makes effective use of PackedInts.ReaderIterator but frequently skips multiple values. ReaderIterator currently requires to loop over ReaderInterator#next() to advance to a certain value. We should allow ReaderIterator to expose a #advance(ord) method to make use-cases like that more efficient. This issue is somewhat part of my efforts to make LUCENE-2186 smaller while breaking it up in little issues for parts which can be generally useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: trie* fields and sortMissingLast?
On Thu, Sep 16, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote: (i changed the subject to see if Uwe perks up) Is it possible to change the FieldCache for Trie* fields so that it knows what fields are missing? or is there something about the Trie structure that makes that impossible. Nope - it is trivial to record that while the entry is being built for all of the current FieldCache entry types - it's just not currently done. After it is recorded (via a bitset most likely), it needs to be exposed via an API. It would be great to be able to deprecate sint,slong,sfloat,sdouble +1 -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910262#action_12910262 ] Michael McCandless commented on LUCENE-2324: Is this near-comittable? Ie just the DWPT cutover? This part seems separable from making each DWPT's buffer searchable? I'm running some tests w/ 20 indexing threads and I think the sync'd flush is a big bottleneck... Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910268#action_12910268 ] Jason Rutherglen commented on LUCENE-2324: -- bq. I think the sync'd flush is a big bottleneck Is this because indexing stops while the DWPT segment is being flushed to disk or are you referring to a different sync? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910276#action_12910276 ] Michael McCandless commented on LUCENE-2324: bq. Is this because indexing stops while the DWPT segment is being flushed to disk or are you referring to a different sync? I'm talking about Lucene trunk today (ie before this patch). Yes, because indexing of all 20 threads is blocked while a single thread moves the RAM buffer to disk. But, with this patch, each thread will privately move its own RAM buffer to disk, not blocking the rest. With 20 threads I'm seeing ~4 seconds of concurrent indexing and then 6-8 seconds to flush (w/ 256 MB RAM buffer). Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: trie* fields and sortMissingLast?
On Thu, Sep 16, 2010 at 11:28 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Sep 16, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote: (i changed the subject to see if Uwe perks up) Is it possible to change the FieldCache for Trie* fields so that it knows what fields are missing? or is there something about the Trie structure that makes that impossible. Nope - it is trivial to record that while the entry is being built for all of the current FieldCache entry types - it's just not currently done. After it is recorded (via a bitset most likely), it needs to be exposed via an API. Looking at the FieldCache (first time ever), I'm not sure I see an obvious place to augment the cache with a BitSet for the matching docs. We could add a function to the FieldCache like: public BitSet getMatchingDocs(IndexReader reader, String field ) That would cache the matching docs for a field, however that means you would have to traverse the terms twice. The existing API for caching values stores the values (short[], int[], etc) not the Entry, so augmenting the cached Entry with a BitSet would get lost. It seems that this could be done, but would require some rejiggering to the API. The API could return an object like: class ByteValues { byte[] values; BitSet valid; } public ByteValues getBytes (IndexReader reader, String field) Another option (just brainstorming) would be to set the arrays to a special value to say they are 'missing' for example Integer.MIN_VALUE. The downside of this is that we lose one valid value in the range. For int, double, float, this may be OK, but for byte and short this is a pretty big tradeoff. Ideas for what may be a good path forward? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries
[ https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910287#action_12910287 ] Mark Bennett commented on SOLR-1852: I realize this is closed, but I found a workaround for those who are still working with a pre-fix version. Just put the stopwords filter after the Word Delimiter filter. That worked for us without impacting much else, until we can get over to the new version. enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries - Key: SOLR-1852 URL: https://issues.apache.org/jira/browse/SOLR-1852 Project: Solr Issue Type: Bug Affects Versions: 1.4 Reporter: Peter Wolanin Assignee: Robert Muir Fix For: 1.4.1 Attachments: SOLR-1852.patch, SOLR-1852_solr14branch.patch, SOLR-1852_testcase.patch Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png Whether or not the bug appears is determined by the surrounding text: would be great to have support for Identi.ca on the follow block fails to match Identi.ca, but putting the content on its own or in another sentence: Support Identi.ca the search matches. Testing suggests the word for is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter. Setting enablePositionIncrements=false in the stop filter and reindexing causes the searches to match. According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-2064) Search Grouping: support highlighting
[ https://issues.apache.org/jira/browse/SOLR-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley resolved SOLR-2064. Fix Version/s: 4.0 Resolution: Fixed fix committed. Search Grouping: support highlighting - Key: SOLR-2064 URL: https://issues.apache.org/jira/browse/SOLR-2064 Project: Solr Issue Type: Sub-task Reporter: Yonik Seeley Fix For: 4.0 Highlighting should be supported regardless of where the documents occur in a response, and regardless of the format (grouped, standard, etc). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: trie* fields and sortMissingLast?
Hi, is there already an issue open for the Bits iunterface in parallel to the native types arrays in FieldCache? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, September 16, 2010 11:28 AM To: dev@lucene.apache.org Subject: Re: trie* fields and sortMissingLast? On Thu, Sep 16, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote: (i changed the subject to see if Uwe perks up) Is it possible to change the FieldCache for Trie* fields so that it knows what fields are missing? or is there something about the Trie structure that makes that impossible. Nope - it is trivial to record that while the entry is being built for all of the current FieldCache entry types - it's just not currently done. After it is recorded (via a bitset most likely), it needs to be exposed via an API. It would be great to be able to deprecate sint,slong,sfloat,sdouble +1 -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: trie* fields and sortMissingLast?
It seems that this could be done, but would require some rejiggering to the API. The API could return an object like: class ByteValues { byte[] values; BitSet valid; } public ByteValues getBytes (IndexReader reader, String field) Thats the plan how to do it. Only replace BitSet by the Bits interface (which is available in trunk). Bits is also implemented by OpenBitSet, so the cache can be backed by OpenBitSet. You have to only consult terms one time. Start with empty Bits and place a mark on each document that has a got a value assigned. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2649) FieldCache should include a BitSet for matching docs
FieldCache should include a BitSet for matching docs Key: LUCENE-2649 URL: https://issues.apache.org/jira/browse/LUCENE-2649 Project: Lucene - Java Issue Type: Improvement Reporter: Ryan McKinley Fix For: 4.0 The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value. This should be changed to return an object representing the values *and* a BitSet for all valid docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2649) FieldCache should include a BitSet for matching docs
[ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910446#action_12910446 ] Ryan McKinley commented on LUCENE-2649: --- See some discussion here: http://search.lucidimagination.com/search/document/b6a531f7b73621f1/trie_fields_and_sortmissinglast FieldCache should include a BitSet for matching docs Key: LUCENE-2649 URL: https://issues.apache.org/jira/browse/LUCENE-2649 Project: Lucene - Java Issue Type: Improvement Reporter: Ryan McKinley Fix For: 4.0 The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value. This should be changed to return an object representing the values *and* a BitSet for all valid docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2649) FieldCache should include a BitSet for matching docs
[ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McKinley updated LUCENE-2649: -- Attachment: LUCENE-2649-FieldCacheWithBitSet.patch This patch replaces the cached primitive[] with a CachedObject. The object heiarch looks like this: {code:java} public abstract static class CachedObject { } public abstract static class CachedArray extends CachedObject { public final Bits valid; public CachedArray( Bits valid ) { this.valid = valid; } }; public static final class ByteValues extends CachedArray { public final byte[] values; public ByteValues( byte[] values, Bits valid ) { super( valid ); this.values = values; } }; ... {code} Then this @deprecates the getBytes() classes and replaces them with getByteValues() {code:java} public ByteValues getByteValues(IndexReader reader, String field) throws IOException; public ByteValues getByteValues(IndexReader reader, String field, ByteParser parser) throws IOException; {code} then repeat for all the other types! All tests pass with this patch, but i have not added any tests for the BitSet (yet) If people like the general look of this approach, I will clean it up and add some tests, javadoc cleanup etc FieldCache should include a BitSet for matching docs Key: LUCENE-2649 URL: https://issues.apache.org/jira/browse/LUCENE-2649 Project: Lucene - Java Issue Type: Improvement Reporter: Ryan McKinley Fix For: 4.0 Attachments: LUCENE-2649-FieldCacheWithBitSet.patch The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value. This should be changed to return an object representing the values *and* a BitSet for all valid docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: trie* fields and sortMissingLast?
I could not find anything similar in JIRA, and went ahead and implemented: https://issues.apache.org/jira/browse/LUCENE-2649 On Thu, Sep 16, 2010 at 5:21 PM, Uwe Schindler u...@thetaphi.de wrote: It seems that this could be done, but would require some rejiggering to the API. The API could return an object like: class ByteValues { byte[] values; BitSet valid; } public ByteValues getBytes (IndexReader reader, String field) Thats the plan how to do it. Only replace BitSet by the Bits interface (which is available in trunk). Bits is also implemented by OpenBitSet, so the cache can be backed by OpenBitSet. You have to only consult terms one time. Start with empty Bits and place a mark on each document that has a got a value assigned. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2649) FieldCache should include a BitSet for matching docs
[ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McKinley updated LUCENE-2649: -- Attachment: LUCENE-2649-FieldCacheWithBitSet.patch A slightly simplified version FieldCache should include a BitSet for matching docs Key: LUCENE-2649 URL: https://issues.apache.org/jira/browse/LUCENE-2649 Project: Lucene - Java Issue Type: Improvement Reporter: Ryan McKinley Fix For: 4.0 Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value. This should be changed to return an object representing the values *and* a BitSet for all valid docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2649) FieldCache should include a BitSet for matching docs
[ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910461#action_12910461 ] Uwe Schindler commented on LUCENE-2649: --- That looks exactly like I proposed it! The only thing: For DocTerms the approach is not needed? You can check for null, so the Bits interface is not needed. As the OpenBitSets are created with the exact size and don't need to grow, you can use fastSet to speed up creation by doing no bounds checks. FieldCache should include a BitSet for matching docs Key: LUCENE-2649 URL: https://issues.apache.org/jira/browse/LUCENE-2649 Project: Lucene - Java Issue Type: Improvement Reporter: Ryan McKinley Fix For: 4.0 Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value. This should be changed to return an object representing the values *and* a BitSet for all valid docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2649) FieldCache should include a BitSet for matching docs
[ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910464#action_12910464 ] Uwe Schindler commented on LUCENE-2649: --- When this is committed, we can improve also some Lucene parts: FieldCacheRangeFilter does not need to do extra deletion checks and instead use the Bits interface to find missing/non-valued documents. Lucene's sorting Collectors can be improved to have a consistent behaviour for missing values (like Solr's sortMissingFirst/Last). FieldCache should include a BitSet for matching docs Key: LUCENE-2649 URL: https://issues.apache.org/jira/browse/LUCENE-2649 Project: Lucene - Java Issue Type: Improvement Reporter: Ryan McKinley Fix For: 4.0 Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value. This should be changed to return an object representing the values *and* a BitSet for all valid docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org