[jira] [Commented] (SOLR-2242) Get distinct count of names for a facet field
[ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291803#comment-13291803 ] Jason Rutherglen commented on SOLR-2242: Terrance, can you post a patch to the Jira? It makes sense to start this Jira off non-distributed, and add a distributed version in another Jira issue... Get distinct count of names for a facet field - Key: SOLR-2242 URL: https://issues.apache.org/jira/browse/SOLR-2242 Project: Solr Issue Type: New Feature Components: Response Writers Affects Versions: 4.0 Reporter: Bill Bell Priority: Minor Fix For: 4.0 Attachments: SOLR-2242-3x.patch, SOLR-2242-3x_5_tests.patch, SOLR-2242-solr40-3.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch When returning facet.field=name of field you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1. The feature is called namedistinct. Here is an example: http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=2facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=0facet.limit=-1facet.field=price http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solrindent=trueq=*:*facet=truefacet.mincount=1facet.numFacetTerms=1facet.limit=-1facet.field=price This currently only works on facet.field. {code} lst name=facet_fields lst name=price int name=numFacetTerms14/int int name=0.03/intint name=11.51/intint name=19.951/intint name=74.991/intint name=92.01/intint name=179.991/intint name=185.01/intint name=279.951/intint name=329.951/intint name=350.01/intint name=399.01/intint name=479.951/intint name=649.991/intint name=2199.01/int /lst /lst {code} Several people use this to get the group.field count (the # of groups). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2569) Enable facile moving of cores
[ https://issues.apache.org/jira/browse/SOLR-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen resolved SOLR-2569. Resolution: Won't Fix Enable facile moving of cores - Key: SOLR-2569 URL: https://issues.apache.org/jira/browse/SOLR-2569 Project: Solr Issue Type: Improvement Components: multicore, replication (java) Affects Versions: 4.0 Reporter: Jason Rutherglen Spin-off from this thread: http://search-lucene.com/m/5CO7Z1oOrh6/elastic+searchsubj=Solr+vs+ElasticSearch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3441) Add NRT support to LuceneTaxonomyReader
[ https://issues.apache.org/jira/browse/LUCENE-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13109599#comment-13109599 ] Jason Rutherglen commented on LUCENE-3441: -- It would be great if the cost of (re)opening a new LTR is. Also an explanation of what it's doing underneath. Add NRT support to LuceneTaxonomyReader --- Key: LUCENE-3441 URL: https://issues.apache.org/jira/browse/LUCENE-3441 Project: Lucene - Java Issue Type: New Feature Components: modules/facet Reporter: Shai Erera Priority: Minor Currently LuceneTaxonomyReader does not support NRT - i.e., on changes to LuceneTaxonomyWriter, you cannot have the reader updated, like IndexReader/Writer. In order to do that we need to do the following: # Add ctor to LuceneTaxonomyReader to allow you to instantiate it with LuceneTaxonomyWriter. # Add API to LuceneTaxonomyWriter to expose its internal IndexReader # Change LTR.refresh() to return an LTR, rather than void. This is actually not strictly related to that issue, but since we'll need to modify refresh() impl, I think it'll be good to change its API as well. Since all of facet API is @lucene.experimental, no backwards issues here (and the sooner we do it, the better). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2778) Revise distributed code inside SearchComponents
[ https://issues.apache.org/jira/browse/SOLR-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13108316#comment-13108316 ] Jason Rutherglen commented on SOLR-2778: Sweet-ness.com! Revise distributed code inside SearchComponents --- Key: SOLR-2778 URL: https://issues.apache.org/jira/browse/SOLR-2778 Project: Solr Issue Type: Improvement Reporter: Martijn van Groningen The distributed code inside search components such as QueryComponent and FacetComponent is complex. By structuring responsibilities the code becomes less complex and easier to understand. There is already a start for this that was part of distributed grouping (SOLR-2066). The following concepts were developed inside QueryComponent for SOLR-2066: * ShardRequestFactory is responsible for creating requests to shards in the cluster based on the incoming request from the client. * ShardResultTransformer. Transforming a NamedList response from the client in for example SearchGroup or TopDocs instance. * ShardResponseProcessor. Basically merges the shard responses. The ShardReponseProcessor uses a ShardResultTransformer to transform the shard response into a native structure (SearchGroup / TopGroups). These concepts are now only used for distributed grouping, but I think can also be used for non grouped distributed search. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2066) Search Grouping: support distributed search
[ https://issues.apache.org/jira/browse/SOLR-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13105474#comment-13105474 ] Jason Rutherglen commented on SOLR-2066: +1 on Concepts that can also be used for non grouped distributed searches in a separate issue. The Solr distributed search code is overly complicated. Search Grouping: support distributed search --- Key: SOLR-2066 URL: https://issues.apache.org/jira/browse/SOLR-2066 Project: Solr Issue Type: Sub-task Reporter: Yonik Seeley Fix For: 3.5, 4.0 Attachments: SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch, SOLR-2066.patch Support distributed field collapsing / search grouping. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3433) Random access non RAM resident IndexDocValues (CSF)
[ https://issues.apache.org/jira/browse/LUCENE-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13104079#comment-13104079 ] Jason Rutherglen commented on LUCENE-3433: -- This is somewhat funny, as it seems the opinion has changed on MMap'ing and the potential for page faults: http://www.lucidimagination.com/search/document/8951a336dffa9535/storing_and_loading_the_fst_directly_from_disk#8951a336dffa9535 Random access non RAM resident IndexDocValues (CSF) --- Key: LUCENE-3433 URL: https://issues.apache.org/jira/browse/LUCENE-3433 Project: Lucene - Java Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 There should be a way to get specific IndexDocValues by going through the Directory rather than loading all of the values into memory. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3433) Random access non RAM resident IndexDocValues (CSF)
[ https://issues.apache.org/jira/browse/LUCENE-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13104165#comment-13104165 ] Jason Rutherglen commented on LUCENE-3433: -- Here's another thread discussing MMap'ing and field caches, where the consensus is against it: http://www.lucidimagination.com/search/document/70623ef5879bca38/fst_and_fieldcache#45006a7fe2847c09 posted in 1969-12-31 19:00 :) Random access non RAM resident IndexDocValues (CSF) --- Key: LUCENE-3433 URL: https://issues.apache.org/jira/browse/LUCENE-3433 Project: Lucene - Java Issue Type: New Feature Affects Versions: 4.0 Reporter: Yonik Seeley Fix For: 4.0 There should be a way to get specific IndexDocValues by going through the Directory rather than loading all of the values into memory. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13101391#comment-13101391 ] Jason Rutherglen commented on LUCENE-2312: -- There are many important use cases for immediate / zero delay index readers. I'm not sure if people realize it, but one of the major gains from this issue, is the ability to obtain a reader after every indexed document. In this case, instead of performing an array copy of the RT data structures, we will queue the changes, and then apply to the new reader. For arrays like term freqs, we will use a temp hash map of the changes made since the main array was created (when the hash map grows too large we can perform a full array copy). Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Michael Busch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2700) transaction logging
[ https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13099138#comment-13099138 ] Jason Rutherglen commented on SOLR-2700: I'm not sure how this feature makes any sense, the documents are already being serialized to disk, eg, to the docstore by StoredFieldsWriter. Now the system will be serializing the exact same documents twice, that is extremely redundant. transaction logging --- Key: SOLR-2700 URL: https://issues.apache.org/jira/browse/SOLR-2700 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Attachments: SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch A transaction log is needed for durability of updates, for a more performant realtime-get, and for replaying updates to recovering peers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2700) transaction logging
[ https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13099239#comment-13099239 ] Jason Rutherglen commented on SOLR-2700: This is going to best be amazing, I wonder if other projects have already implemented these features years ago? transaction logging --- Key: SOLR-2700 URL: https://issues.apache.org/jira/browse/SOLR-2700 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Attachments: SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch A transaction log is needed for durability of updates, for a more performant realtime-get, and for replaying updates to recovering peers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2748) autocommit commits too many times
[ https://issues.apache.org/jira/browse/SOLR-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13099722#comment-13099722 ] Jason Rutherglen commented on SOLR-2748: Seeing all of the bugs related to the Solr NRT code, I can't help but wonder why the 4.x version of the project needs to be backward compatible. Also why it's not using IndexReaderWarmer which was ostensibly created precisely for Solr's usage (and, it's not used in Solr and never has been). autocommit commits too many times - Key: SOLR-2748 URL: https://issues.apache.org/jira/browse/SOLR-2748 Project: Solr Issue Type: Bug Reporter: Yonik Seeley Attachments: SOLR-2748.patch autocommit seems to commit more frequently than configured. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
[ https://issues.apache.org/jira/browse/LUCENE-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13097246#comment-13097246 ] Jason Rutherglen commented on LUCENE-3199: -- I started integrating the patch into LUCENE-2312. I think the main functionality missing is a reverse int[] that points from a term id to the sorted ords array. That array would be used for implementing the RT version of DocTermsIndex, where a doc id - term id - sorted term id index. Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
[ https://issues.apache.org/jira/browse/LUCENE-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13097257#comment-13097257 ] Jason Rutherglen commented on LUCENE-3199: -- Ok, solved the above comment by taking the sorted ord array and building a new reverse array from that... Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
[ https://issues.apache.org/jira/browse/LUCENE-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3199: - Attachment: LUCENE-3199.patch This is a minor update when compared with the last patch. It adds the option of pruning the [oversized] int[] returned by the compact method. Added are additional unit tests. Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3199.patch, LUCENE-3199.patch Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
[ https://issues.apache.org/jira/browse/LUCENE-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096108#comment-13096108 ] Jason Rutherglen commented on LUCENE-3199: -- Simon, In summary this is using the BytesRefHash sort, performing array copies and then merge [sorting] into a new copy / view. Array copies are fast and counter intuitively generate far less garbage than objects (in Java). Instead of creating term 'segments' that would be merged while iterating the terms enum, we'll be generating static point-in-time terms dict views. These will be useful for enabling DocTermsIndex field caches for RT, the only remaining design 'challenge' for RT / LUCENE-2312. Because there is a terms hash, we can seek exact to the term rather than perform an [optimized] seek to the term. Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
[ https://issues.apache.org/jira/browse/LUCENE-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096231#comment-13096231 ] Jason Rutherglen commented on LUCENE-3199: -- Simon, I think your patch should be in a different issue, eg, sorted bytes ref hash view or something. Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch, LUCENE-3199.patch Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13095412#comment-13095412 ] Jason Rutherglen commented on LUCENE-2312: -- I'll post a new patch shortly that fixes bugs and adds a bit more to the functionality. The benchmark results are interesting. Array copies are very fast, I don't see any problems with that, the median time is 2 ms. The concurrent skip list map is expensive to add numerous 10s of thousands of terms to. I think that is to be expected. The strategy of amortizing that cost by creating sorted by term int[]s will probably be more performant than CSLM. The sorted int[] terms can be merged just like segments, thus RT becomes a way to remove the [NRT] cost of merging [numerous] postings lists. The int[] terms can be merged in the background so that raw indexing speed is not affected. Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-2312: - Fix Version/s: (was: Realtime Branch) Affects Version/s: (was: Realtime Branch) 4.0 Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Michael Busch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
[ https://issues.apache.org/jira/browse/LUCENE-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3199: - Attachment: LUCENE-3199.patch Here's a version of this issue. Added are a couple of new methods to TestBytesRefHash to test the new frozen compact and sorting functionality of BytesRefHash. This is being posted now because it's useful in relation to LUCENE-2312 and a terms dictionary that is composed of sorted by term[id]s int[]s. Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3199.patch Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-2312: - Attachment: LUCENE-2312.patch Here's a new patch that incrementally adds field cache and norms values. Meaning that as documents are added / indexed, norms and field cache values are automatically created. The field cache values are only added to if they have already been created. The field cache functionality needs to be completed for all types. We probably need to get the indexing lock while the field cache value is initially being created (eg, the terms enumeration). We're more or less feature complete now. Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-2312: - Attachment: LUCENE-2312.patch This is a revised version of the LUCENE-2312 patch. The following are various and miscelaneous notes pertaining to the patch and where it needs to go to be committed. Feel free to review the approach taken, eg, we're getting around non-realtime structures through the usage of array copies (of which the arrays can be pooled at some point). * A copy of FreqProxPostingsArray.termFreqs is made per new reader. That array can be pooled. This is no different than the deleted docs BitVector which is created anew per-segment for any deletes that have occurred. * FreqProxPostingsArray freqUptosRT, proxUptosRT, lastDocIDsRT, lastDocFreqsRT is copied into, per new reader (as opposed to an entirely new array instantiated for each new reader), this is a slight optimization in object allocation. * For deleting, a DWPT is clothed in an abstract class that exposes the necessary methods from segment info, so that deletes may be applied to the RT RAM reader. The deleting is still performed in BufferedDeletesStream. BitVectors are cloned as well. There is room for improvement, eg, pooling the BV byte[]’s. * Documents (FieldsWriter) and term vectors are flushed on each get reader call, so that reading will be able to load the data. We will need to test if this is performant. We are not creating new files so this way of doing things may well be efficient. * We need to measure the cost of the native system array copy. It could very well be quite fast / enough. * Full posting functionality should be working including payloads * Field caching may be implemented as a new field cache that is growable and enables lock’d replacement of the underlying array * String to string ordinal comparison caches needs to be figured out. The RAM readers cannot maintain a sorted terms index the way statically sized segments do * When a field cache value is first being created, it needs to obtain the indexing lock on the DWPT. Otherwise documents will continue to be indexed, new values created, while the array will miss the new values. The downside is that while the array is initially being created, indexing will stop. This can probably be solved at some point by only locking during the creation of the field cache array, and then notifying the DWPT of the new array. New values would then accumulate into the array from the point of the max doc of the reader the values creator is working from. * The terms dictionary is a ConcurrentSkipListMap. We can periodically convert it into a sorted [by term] int[], that has an FST on top. Have fun reviewing! :) Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2700) transaction logging
[ https://issues.apache.org/jira/browse/SOLR-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090722#comment-13090722 ] Jason Rutherglen commented on SOLR-2700: Typically a transaction log configured to be written to a different hard drive than the indexes / database. transaction logging --- Key: SOLR-2700 URL: https://issues.apache.org/jira/browse/SOLR-2700 Project: Solr Issue Type: New Feature Reporter: Yonik Seeley Attachments: SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch, SOLR-2700.patch A transaction log is needed for durability of updates, for a more performant realtime-get, and for replaying updates to recovering peers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090764#comment-13090764 ] Jason Rutherglen commented on LUCENE-2312: -- A benchmark plan is, compare the speed of NRT vs. RT. Index documents in a single thread, in a 2nd thread open a reader and perform a query. It would be nice to synchronize the point / max doc at which RT and NRT open new readers to additionally verify the correctness of the directly comparable search results. To make the test fair, concurrent merge scheduler should be turned off in the NRT test. The hypothesis is that array copying, even on large [RT] indexes is no big deal compared with the excessive segment merging with NRT. Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3399) Enable replace-able field caches
Enable replace-able field caches Key: LUCENE-3399 URL: https://issues.apache.org/jira/browse/LUCENE-3399 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor For LUCENE-2312 we need to be able to synchronously replace field cache values and receive events on when new field cache values are created. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3399) Enable replace-able field caches
[ https://issues.apache.org/jira/browse/LUCENE-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3399: - Attachment: LUCENE-3399.patch A cut of this. Enable replace-able field caches Key: LUCENE-3399 URL: https://issues.apache.org/jira/browse/LUCENE-3399 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3399.patch For LUCENE-2312 we need to be able to synchronously replace field cache values and receive events on when new field cache values are created. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2702) Add support for NRTCachingDirectory
[ https://issues.apache.org/jira/browse/SOLR-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13087438#comment-13087438 ] Jason Rutherglen commented on SOLR-2702: Can we mark this for Lucene 3.x as well? Add support for NRTCachingDirectory --- Key: SOLR-2702 URL: https://issues.apache.org/jira/browse/SOLR-2702 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 would be nice to have this option for the new NRT support -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086754#comment-13086754 ] Jason Rutherglen commented on SOLR-1431: Can we look at backporting this one to 3.x, given 4.x is a little ways off? CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Fix For: 4.0 Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2565) Prevent IW#close and cut over to IW#commit
[ https://issues.apache.org/jira/browse/SOLR-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086755#comment-13086755 ] Jason Rutherglen commented on SOLR-2565: Can this one be backported to 3.x? It would probably be fairly useful for people to use now? Prevent IW#close and cut over to IW#commit -- Key: SOLR-2565 URL: https://issues.apache.org/jira/browse/SOLR-2565 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2565.patch, SOLR-2565.patch, SOLR-2565.patch Spinnoff from SOLR-2193. We already have a branch to work on this issue here https://svn.apache.org/repos/asf/lucene/dev/branches/solr2193 The main goal here is to prevent solr from closing the IW and use IW#commit instead. AFAIK the main issues here are: The update handler needs an overhaul. A few goals I think we might want to look at: 1. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: 2. Stop closing the IndexWriter and start using commit (still lazy IW init though). 3. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 4. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. Eventually this is a preparation for NRT support in Solr which I will create a followup issue for. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2565) Prevent IW#close and cut over to IW#commit
[ https://issues.apache.org/jira/browse/SOLR-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13075980#comment-13075980 ] Jason Rutherglen commented on SOLR-2565: This issue says committed in the comments, however it's status is: Unresolved? Prevent IW#close and cut over to IW#commit -- Key: SOLR-2565 URL: https://issues.apache.org/jira/browse/SOLR-2565 Project: Solr Issue Type: Improvement Components: update Affects Versions: 4.0 Reporter: Simon Willnauer Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2565.patch Spinnoff from SOLR-2193. We already have a branch to work on this issue here https://svn.apache.org/repos/asf/lucene/dev/branches/solr2193 The main goal here is to prevent solr from closing the IW and use IW#commit instead. AFAIK the main issues here are: The update handler needs an overhaul. A few goals I think we might want to look at: 1. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: 2. Stop closing the IndexWriter and start using commit (still lazy IW init though). 3. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 4. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. Eventually this is a preparation for NRT support in Solr which I will create a followup issue for. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3348) IndexWriter applies wrong deletes during concurrent flush-all
[ https://issues.apache.org/jira/browse/LUCENE-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13072962#comment-13072962 ] Jason Rutherglen commented on LUCENE-3348: -- Sorry to add my opinion to this, however I think that while non-blocking deletes are quite fancy, it seems they are open to various bugs such as this. Is there a compelling reason non-locking is used, eg, performance? IndexWriter applies wrong deletes during concurrent flush-all - Key: LUCENE-3348 URL: https://issues.apache.org/jira/browse/LUCENE-3348 Project: Lucene - Java Issue Type: Bug Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.4, 4.0 Attachments: LUCENE-3348.patch Yonik uncovered this with the TestRealTimeGet test: if a flush-all is underway, it is possible for an incoming update to pick a DWPT that is stale, ie, not yet pulled/marked for flushing, yet the DW has cutover to a new deletes queue. If this happens, and the deleted term was also updated in one of the non-stale DWPTs, then the wrong document is deleted and the test fails by detecting the wrong value. There's a 2nd failure mode that I haven't figured out yet, whereby 2 docs are returned when searching by id (there should only ever be 1 doc since the test uses updateDocument which is atomic wrt commit/reopen). Yonik verified the test passes pre-DWPT, so my guess is (but I have yet to verify) this test also passes on 3.x. I'll backport the test to 3.x to be sure. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064751#comment-13064751 ] Jason Rutherglen commented on LUCENE-2312: -- I had been testing out an alternative skip list. I think it's a bit too esoteric at this point. I'm resuming work on this issue, using Java's CSLM for the terms dict. There really isn't a good way to break up the patch, it's just going to be large, eg, we can't separate out the terms dict from the RT postings etc. Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062862#comment-13062862 ] Jason Rutherglen commented on LUCENE-3296: -- Uwe, the first patch [1] is implemented with CURRENT. 1. https://issues.apache.org/jira/secure/attachment/12485805/LUCENE-3296.patch Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 3.3, 4.0 Reporter: Jason Rutherglen Assignee: Simon Willnauer Priority: Trivial Attachments: LUCENE-3296.patch, LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3296: - Attachment: LUCENE-3296.patch This patch uses LUCENE_40. All tests pass. Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Simon Willnauer Priority: Trivial Attachments: LUCENE-3296.patch, LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062349#comment-13062349 ] Jason Rutherglen commented on LUCENE-3296: -- That was in there previously. Lets change it. Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Simon Willnauer Priority: Trivial Attachments: LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2919) IndexSplitter that divides by primary key term
[ https://issues.apache.org/jira/browse/LUCENE-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062281#comment-13062281 ] Jason Rutherglen commented on LUCENE-2919: -- Sorry for the naive off/on-topic question. Ryan, what's the repository info that needs to be added to the pom.xml so that the project downloads the 4.0 snapshot? Eg, I don't think it's: {code} repository idlucene/id urlhttps://builds.apache.org/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/org/apache//url snapshots enabledtrue/enabled /snapshots /repository {code} IndexSplitter that divides by primary key term -- Key: LUCENE-2919 URL: https://issues.apache.org/jira/browse/LUCENE-2919 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Uwe Schindler Priority: Minor Fix For: 3.3, 4.0 Attachments: LUCENE-2919-3x.patch, LUCENE-2919-filter.patch, LUCENE-2919-filter.patch, LUCENE-2919-filter.patch, LUCENE-2919.patch Index splitter that divides by primary key term. The contrib MultiPassIndexSplitter we have divides by docid, however to guarantee external constraints it's sometimes necessary to split by a primary key term id. I think this implementation is a fairly trivial change. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3296) Enable passing a config into PKIndexSplitter
Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Trivial I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3296) Enable passing a config into PKIndexSplitter
[ https://issues.apache.org/jira/browse/LUCENE-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3296: - Attachment: LUCENE-3296.patch Patch, all tests pass. Enable passing a config into PKIndexSplitter Key: LUCENE-3296 URL: https://issues.apache.org/jira/browse/LUCENE-3296 Project: Lucene - Java Issue Type: Improvement Components: core/other Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Trivial Attachments: LUCENE-3296.patch I need to be able to pass the IndexWriterConfig into the IW used by PKIndexSplitter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3245) Realtime terms dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3245: - Attachment: LUCENE-3245.patch Here's a cut with a first implementation of the CSLM and AIA terms dictionaries. I think we're ready to benchmark writes. Realtime terms dictionary - Key: LUCENE-3245 URL: https://issues.apache.org/jira/browse/LUCENE-3245 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3245.patch, LUCENE-3245.patch, LUCENE-3245.patch For LUCENE-2312 we need a realtime terms dictionary. While ConcurrentSkipListMap may be used, it has drawbacks in terms of high object overhead which can impact GC collection times and heap memory usage. If we implement a skip list that uses primitive backing arrays, we can hopefully have a data structure that is [as] fast and memory efficient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3245) Realtime terms dictionary
Realtime terms dictionary - Key: LUCENE-3245 URL: https://issues.apache.org/jira/browse/LUCENE-3245 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor For LUCENE-2312 we need a realtime terms dictionary. While ConcurrentSkipListMap may be used, it has drawbacks in terms of high object overhead which can impact GC collection times and heap memory usage. If we implement a skip list that uses primitive backing arrays, we can hopefully have a data structure that is [as] fast and memory efficient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3245) Realtime terms dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3245: - Attachment: LUCENE-3245.patch Here's a basic initial patch implementing a single threaded writer, multiple reader atomic integer array skip list. The next step is to tie in the ByteBlockPool to store terms, eg, implement an RTTermsDictAIA class, and an RTTermsDictCSLM class. We can then load the same Wiki-EN terms, and measure the comparative write speeds. Then create a set of terms to lookup from each terms dict and measure the time difference. I am not yet sure how the speed of AtomicIntegerArray will compare with CSLM's usage of AtomicReferenceFieldUpdater. Of note is the fact that because of DWPTs we do not need a skip list that supports concurrent writes. And because we're only adding new unique terms, we do not need delete functionality. Ie, AIA could be faster, though we may need to inline code and perform various tuning tricks. Realtime terms dictionary - Key: LUCENE-3245 URL: https://issues.apache.org/jira/browse/LUCENE-3245 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3245.patch For LUCENE-2312 we need a realtime terms dictionary. While ConcurrentSkipListMap may be used, it has drawbacks in terms of high object overhead which can impact GC collection times and heap memory usage. If we implement a skip list that uses primitive backing arrays, we can hopefully have a data structure that is [as] fast and memory efficient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3245) Realtime terms dictionary
[ https://issues.apache.org/jira/browse/LUCENE-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-3245: - Attachment: LUCENE-3245.patch Added and fixed the code that traverses the skip list to the level zero linked list and iterates. I need to reuse the starts int array, that's next. Realtime terms dictionary - Key: LUCENE-3245 URL: https://issues.apache.org/jira/browse/LUCENE-3245 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-3245.patch, LUCENE-3245.patch For LUCENE-2312 we need a realtime terms dictionary. While ConcurrentSkipListMap may be used, it has drawbacks in terms of high object overhead which can impact GC collection times and heap memory usage. If we implement a skip list that uses primitive backing arrays, we can hopefully have a data structure that is [as] fast and memory efficient. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action
[ https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053992#comment-13053992 ] Jason Rutherglen commented on SOLR-2610: Mark put it aptly. The problem I think I encountered in my own version is left over file handles seemed to be preventing the deletion of all the files, many times some of them would be left over. Also I deleted the entire core directory, which is useful for manual testing (eg, to avoid the directory exists exception). Add an option to delete index through CoreAdmin UNLOAD action - Key: SOLR-2610 URL: https://issues.apache.org/jira/browse/SOLR-2610 Project: Solr Issue Type: Improvement Components: multicore Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2610-branch3x.patch, SOLR-2610.patch Right now, one can unload a Solr Core but the index files are left behind and consume disk space. We should have an option to delete the index when unloading a core. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053581#comment-13053581 ] Jason Rutherglen commented on LUCENE-3079: -- Schemas should probably be a module that makes use of embedding the field types per-segment, this is something the faceting module could/should use. I think is what LUCENE-2308 is aiming for? Though I thought there was another Jira issue created by Simon for this as well. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3079) Facetiing module
[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053591#comment-13053591 ] Jason Rutherglen commented on LUCENE-3079: -- bq. I don't think any Facet module needs to be concerned with Schemas Right, they should be field type aware. Facetiing module Key: LUCENE-3079 URL: https://issues.apache.org/jira/browse/LUCENE-3079 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-3079.patch Faceting is a hugely important feature, available in Solr today but not [easily] usable by Lucene-only apps. We should fix this, by creating a shared faceting module. Ideally, we factor out Solr's faceting impl, and maybe poach/merge from other impls (eg Bobo browse). Hoss describes some important challenges we'll face in doing this (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: {noformat} To look at faceting as a concrete example, there are big the reasons faceting works so well in Solr: Solr has total control over the index, knows exactly when the index has changed to rebuild caches, has a strict schema so it can make sense of field types and pick faceting algos accordingly, has multi-phase distributed search approach to get exact counts efficiently across multiple shards, etc... (and there are still a lot of additional enhancements and improvements that can be made to take even more advantage of knowledge solr has because it owns the index that we no one has had time to tackle) {noformat} This is a great list of the things we face in refactoring. It's also important because, if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same needs and so we really have to address them in creating the shared module. I think we should get a basic faceting module started, but should not cut Solr over at first. We should iterate on the module, fold in improvements, etc., and then, once we can fully verify that cutting over doesn't hurt Solr (ie lose functionality or performance) we can later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2610) Add an option to delete index through CoreAdmin UNLOAD action
[ https://issues.apache.org/jira/browse/SOLR-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052624#comment-13052624 ] Jason Rutherglen commented on SOLR-2610: This is good! I had to write the same functionality into a custom Solr build on a project. Add an option to delete index through CoreAdmin UNLOAD action - Key: SOLR-2610 URL: https://issues.apache.org/jira/browse/SOLR-2610 Project: Solr Issue Type: Improvement Components: multicore Reporter: Shalin Shekhar Mangar Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 3.3, 4.0 Attachments: SOLR-2610.patch Right now, one can unload a Solr Core but the index files are left behind and consume disk space. We should have an option to delete the index when unloading a core. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2919) IndexSplitter that divides by primary key term
[ https://issues.apache.org/jira/browse/LUCENE-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051552#comment-13051552 ] Jason Rutherglen commented on LUCENE-2919: -- Thanks, committing this means I can remove a custom GitHub branch with only this patch. Also, it'd be great if we somehow published nightly versions to Maven repositories. Though they'd accumulate over time. IndexSplitter that divides by primary key term -- Key: LUCENE-2919 URL: https://issues.apache.org/jira/browse/LUCENE-2919 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Uwe Schindler Priority: Minor Fix For: 3.3, 4.0 Attachments: LUCENE-2919-3x.patch, LUCENE-2919-filter.patch, LUCENE-2919-filter.patch, LUCENE-2919-filter.patch, LUCENE-2919.patch Index splitter that divides by primary key term. The contrib MultiPassIndexSplitter we have divides by docid, however to guarantee external constraints it's sometimes necessary to split by a primary key term id. I think this implementation is a fairly trivial change. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2919) IndexSplitter that divides by primary key term
[ https://issues.apache.org/jira/browse/LUCENE-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051576#comment-13051576 ] Jason Rutherglen commented on LUCENE-2919: -- @Ryan Thanks! What would one place as the artifact info into the pom.xml? IndexSplitter that divides by primary key term -- Key: LUCENE-2919 URL: https://issues.apache.org/jira/browse/LUCENE-2919 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Uwe Schindler Priority: Minor Fix For: 3.3, 4.0 Attachments: LUCENE-2919-3x.patch, LUCENE-2919-filter.patch, LUCENE-2919-filter.patch, LUCENE-2919-filter.patch, LUCENE-2919.patch Index splitter that divides by primary key term. The contrib MultiPassIndexSplitter we have divides by docid, however to guarantee external constraints it's sometimes necessary to split by a primary key term id. I think this implementation is a fairly trivial change. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050521#comment-13050521 ] Jason Rutherglen commented on SOLR-1431: Seems to be fine. It'd be great to modularize Zookeeper references into a separate abstract interface (like what's done here), and not tie it to CoreContainer. I think it could conflict with other uses of Zookeeper when the library versions are different. CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Fix For: 4.0 Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050613#comment-13050613 ] Jason Rutherglen commented on SOLR-1431: @Noble I agree, I don't think committing this patch should hold things up. That was just a little note. I've been looking at implementing Solr into HBase and am worried [somewhat] about the ZK libaries. HBase + Solr can help with massive scale near realtime systems you've described, eg, HBase implements splitting, partitioning, a fast write ahead log, etc. Facebook has implemented the index directly into HBase, which probably offers degraded indexing and search performance. bq. We badly need the cloud features now Right, many users are going with Elastic Search for the reasons mentioned. CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Fix For: 4.0 Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050638#comment-13050638 ] Jason Rutherglen commented on SOLR-1431: Noble, the Jira issue is HBASE-3529 where much of the code is offline on Git because of the different pieces involved. That being said, I've linked the various Lucene and Solr Jira issues that are required to implement Solr in HBase, eg LUCENE-2919 and SOLR-2563. CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Fix For: 4.0 Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
[ https://issues.apache.org/jira/browse/LUCENE-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13048987#comment-13048987 ] Jason Rutherglen commented on LUCENE-3199: -- I think the issue with this, as it relates to realtime search, is in order to sort, we'll need to freeze indexing. Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-3199) Add non-desctructive sort to BytesRefHash
Add non-desctructive sort to BytesRefHash - Key: LUCENE-3199 URL: https://issues.apache.org/jira/browse/LUCENE-3199 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Currently the BytesRefHash is destructive. We can add a method that returns a non-destructively generated int[]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13047275#comment-13047275 ] Jason Rutherglen commented on SOLR-1431: I just downloaded http://svn.apache.org/repos/asf/lucene/dev/trunk and applied the patch, and test-core passed. However the patch command mentioned specific hunks, though there was no .rej file. CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2955) Add utitily class to manage NRT reopening
[ https://issues.apache.org/jira/browse/LUCENE-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046644#comment-13046644 ] Jason Rutherglen commented on LUCENE-2955: -- Perhaps we can merge this functionality with SOLR-2565 and/or SOLR-2566, such that Solr utilizes it for reader opening. However why would this issue use a background thread and Solr performs a max time reopen? Add utitily class to manage NRT reopening - Key: LUCENE-2955 URL: https://issues.apache.org/jira/browse/LUCENE-2955 Project: Lucene - Java Issue Type: Improvement Components: core/index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.3 Attachments: LUCENE-2955.patch, LUCENE-2955.patch I created a simple class, NRTManager, that tries to abstract away some of the reopen logic when using NRT readers. You give it your IW, tell it min and max nanoseconds staleness you can tolerate, and it privately runs a reopen thread to periodically reopen the searcher. It subsumes the SearcherManager from LIA2. Besides running the reopen thread, it also adds the notion of a generation containing changes you've made. So eg it has addDocument, returning a long. You can then take that long value and pass it back to the getSearcher method and getSearcher will return a searcher that reflects the changes made in that generation. This gives your app the freedom to force immediate consistency (ie wait for the reopen) only for those searches that require it, like a verifier that adds a doc and then immediately searches for it, but also use eventual consistency for other searches. I want to also add support for the new applyDeletions option when pulling an NRT reader. Also, this is very new and I'm sure buggy -- the concurrency is either wrong over overly-locking. But it's a start... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3176) TestNRTThreads test failure
[ https://issues.apache.org/jira/browse/LUCENE-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044981#comment-13044981 ] Jason Rutherglen commented on LUCENE-3176: -- It's probably the new DWPT code. There was a specific issue to fix this problem LUCENE-2956. TestNRTThreads test failure --- Key: LUCENE-3176 URL: https://issues.apache.org/jira/browse/LUCENE-3176 Project: Lucene - Java Issue Type: Bug Environment: trunk Reporter: Robert Muir Assignee: Michael McCandless hit a fail in TestNRTThreads running tests over and over: -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1431: --- Fix Version/s: (was: 3.2) Priority: Major (was: Trivial) Affects Version/s: (was: 1.4) 4.0 CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch Original Estimate: 24h Remaining Estimate: 24h We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1431: --- Remaining Estimate: (was: 24h) Original Estimate: (was: 24h) CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1431: --- Attachment: SOLR-1431.patch Here's a patch updated to trunk. CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated SOLR-1431: --- Attachment: SOLR-1431.patch Methods moved up into abstract class ShardHandler. All tests pass. CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1431) CommComponent abstracted
[ https://issues.apache.org/jira/browse/SOLR-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042866#comment-13042866 ] Jason Rutherglen commented on SOLR-1431: No worries mate! CommComponent abstracted Key: SOLR-1431 URL: https://issues.apache.org/jira/browse/SOLR-1431 Project: Solr Issue Type: Improvement Components: search Affects Versions: 4.0 Reporter: Jason Rutherglen Assignee: Noble Paul Attachments: SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch, SOLR-1431.patch We'll abstract CommComponent in this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2563) Allow generic pluggable file system implementations
[ https://issues.apache.org/jira/browse/SOLR-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042870#comment-13042870 ] Jason Rutherglen commented on SOLR-2563: One way to test this out would be to create a Solr unit test that tries to create a Solr instance on top of HDFS using an HDFSSolrResourceLoader. Then I think the problem areas would reveal themselves. It would be nice to run all of the Solr unit tests this way, however that seems much more complex. Allow generic pluggable file system implementations --- Key: SOLR-2563 URL: https://issues.apache.org/jira/browse/SOLR-2563 Project: Solr Issue Type: New Feature Components: update Affects Versions: 4.0 Reporter: Jason Rutherglen For things like configuration files, they can be loaded from places other than the local filesystem, such as Zookeeper or HDFS. In this issue I will abstract that functionality out. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041995#comment-13041995 ] Jason Rutherglen commented on SOLR-2193: I'm curious if someone who doesn't work at Lucid can be involved in Solr design discussions. In any case, please autocratically continue. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Robert Muir Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042004#comment-13042004 ] Jason Rutherglen commented on SOLR-2193: This article is an indicator of the types of benchmarks to perform: http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/ Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Robert Muir Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042011#comment-13042011 ] Jason Rutherglen commented on SOLR-2193: bq. Jason, this issue isn't intended to solve NRT What is this line doing? {code} newReader = currentReader.reopen(indexWriterProvider.getIndexWriter(), true); {code} Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Robert Muir Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042013#comment-13042013 ] Jason Rutherglen commented on SOLR-2193: Also: https://issues.apache.org/jira/browse/SOLR-2193?focusedCommentId=13016875page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13016875 And this comment: {quote} Can you elaborate on why you don't think it's implementing NRT? I've tested basic indexing/searching using wikipedia documents at about 50-100 documents a second, opening a new reader every second. That felt pretty near-real-time to me, but the phrase is subjective. {quote} from: https://issues.apache.org/jira/browse/SOLR-2193?focusedCommentId=13041268page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13041268 Robert, your statement's confusing. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Robert Muir Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2569) Enable facile moving of cores
Enable facile moving of cores - Key: SOLR-2569 URL: https://issues.apache.org/jira/browse/SOLR-2569 Project: Solr Issue Type: Improvement Components: multicore, replication (java) Affects Versions: 4.0 Reporter: Jason Rutherglen Spin-off from this thread: http://search-lucene.com/m/5CO7Z1oOrh6/elastic+searchsubj=Solr+vs+ElasticSearch -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042259#comment-13042259 ] Jason Rutherglen commented on SOLR-2193: Simon, thanks for opening new issues. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Robert Muir Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041464#comment-13041464 ] Jason Rutherglen commented on SOLR-2193: bq. I enjoyed our dialogue honestly I'd prefer to simply get things done rather than banter with no results. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041691#comment-13041691 ] Jason Rutherglen commented on SOLR-2193: As previously suggested, we need a new issue that refactors IndexWriter into SolrCore, instead of placing it into an UpdateHandler. Then we can iterate on re/factoring the NRT functionality. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041693#comment-13041693 ] Jason Rutherglen commented on SOLR-2193: {quote}this is a fundamentally wrong direction{quote} Yes. The idea of adding NRT is good though. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2563) Allow generic pluggable file system implementations
Allow generic pluggable file system implementations --- Key: SOLR-2563 URL: https://issues.apache.org/jira/browse/SOLR-2563 Project: Solr Issue Type: New Feature Components: update Affects Versions: 4.0 Reporter: Jason Rutherglen For things like configuration files, they can be loaded from places other than the local filesystem, such as Zookeeper or HDFS. In this issue I will abstract that functionality out. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2563) Allow generic pluggable file system implementations
[ https://issues.apache.org/jira/browse/SOLR-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041713#comment-13041713 ] Jason Rutherglen commented on SOLR-2563: Uwe, thanks, I think though there was an issue even trying to use that. I'll take a look and report back! Allow generic pluggable file system implementations --- Key: SOLR-2563 URL: https://issues.apache.org/jira/browse/SOLR-2563 Project: Solr Issue Type: New Feature Components: update Affects Versions: 4.0 Reporter: Jason Rutherglen For things like configuration files, they can be loaded from places other than the local filesystem, such as Zookeeper or HDFS. In this issue I will abstract that functionality out. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041715#comment-13041715 ] Jason Rutherglen commented on SOLR-2193: {quote}I haven't looked that closely a this patch yet, but it already fixes a long standing problem in Solr, that a long running merge blocks a Solr commit, because it switches to IW.commit instead of closing/opening the writer.{quote} Yes, that is/was not clear in the issue. Thank you for spelling it out. However I think the patch is creating new abstract classes, that would then go away? Why not spend a little more time trying to do a more overall design for future refactoring? Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041767#comment-13041767 ] Jason Rutherglen commented on SOLR-2193: {quote}solving our rather nasty reload a core, briefly different writers on the same index problem (usually avoided because the overlap is brief and the IndexWriter created lazily).{quote} Robert I fully agree, however then the title of the Jira is incorrect. Also the whole ref counted thing in Solr: {code} RefCountedSolrIndexSearcher holder = core.getNewestSearcher(false); SolrIndexSearcher s = holder.get(); holder.decref(); // since there could be two commits in a row, don't test for a specific new searcher // just test that the old one has been replaced. {code} Should not be needed anymore. We're also adding ref counting on IWs now as well? All of this is unnecessary. If we're modularizing, this isn't right path to go one. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2563) Allow generic pluggable file system implementations
[ https://issues.apache.org/jira/browse/SOLR-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041924#comment-13041924 ] Jason Rutherglen commented on SOLR-2563: I don't think CoreContainer is completely removed from the local file system. Checkout persist, persistFile, etc. Those should either be turned off, or should write to the underlying generic file system. It looks like libs are hard coded in CoreContainer? {code} if (libDir != null) { File f = FileUtils.resolvePath(new File(dir), libDir); log.info( loading shared library: +f.getAbsolutePath() ); libLoader = SolrResourceLoader.createClassLoader(f, null); } {code} CoreDescriptor.getDataDir() is ambiguous. QueryElevationComponent is hardcoded: {code} // check if using ZooKeeper ZkController zkController = core.getCoreDescriptor().getCoreContainer().getZkController(); if(zkController != null) { {code} IndexBasedSpellChecker.initSourceReader() SolrIndexWriter hardcodes writing the infoStream to the local file system. The benchmark code is as well however that's somewhat less of a priority. Allow generic pluggable file system implementations --- Key: SOLR-2563 URL: https://issues.apache.org/jira/browse/SOLR-2563 Project: Solr Issue Type: New Feature Components: update Affects Versions: 4.0 Reporter: Jason Rutherglen For things like configuration files, they can be loaded from places other than the local filesystem, such as Zookeeper or HDFS. In this issue I will abstract that functionality out. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1395) Integrate Katta
[ https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041026#comment-13041026 ] Jason Rutherglen commented on SOLR-1395: I think John Wu brings up excellent points. I don't think Solr Cloud offers the same thing as this issue, and/or it's not articulated well on the wiki. Lucene out of the box doesn't offer facets and other search component features. These are things Solr provides but could/should be modularized out as already proposed. Solr is currently too tightly interwoven, this is perhaps why this patch is challenging to operate. Integrating alternative systems into Solr seems to be political from my point of view, eg, politicalSolr + Katta/political Integrate Katta --- Key: SOLR-1395 URL: https://issues.apache.org/jira/browse/SOLR-1395 Project: Solr Issue Type: New Feature Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.2 Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, katta-solrcores.jpg, katta.node.properties, katta.zk.properties, log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, solr-1395-katta-0.6.2.patch, solr1395.jpg, test-katta-core-0.6-dev.jar, zkclient-0.1-dev.jar, zookeeper-3.2.1.jar Original Estimate: 336h Remaining Estimate: 336h We'll integrate Katta into Solr so that: * Distributed search uses Hadoop RPC * Shard/SolrCore distribution and management * Zookeeper based failover * Indexes may be built using Hadoop -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041265#comment-13041265 ] Jason Rutherglen commented on SOLR-2193: I think the Solr ref counting code should go/exit, it's prone to pile up. Instead as with Twitter's system, a new reader is opened per query, because the readers are lightweight enough. I think that's a better path to pursue than monkey wrenching Solr's existing system which from the ground up, is not designed for NRT. If this patch isn't implementing NRT, what is the point? Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041276#comment-13041276 ] Jason Rutherglen commented on SOLR-2193: bq. This patch certainly won't complete the NRT work needed Mark, I was reading this comment. bq. You are questioning my whole patch I think it'll be easier to add what's needed for this patch into Lucene rather than retrofit Solr. I mentioned this a while back however there was pushback on re-architecting Solr. Making everything per-segment would be much more productive than allowing NRT at this stage. Ah, I think you're simply trying to avoid the stop the world Solr has right now? If so that should be more prevalent in the Jira. bq. IndexWriter writer = ((DirectUpdateHandler2)core.getUpdateHandler()).getIndexWriterProvider().getIndexWriter(); Ugly Solr style code?! The commit in X time can be simple contrib class for Lucene. It doesn't need to be Solr specific. Anyways I tried to do this 2 years ago for NRT, there was pushback just get the IndexWriter like the above code from the update handler. politicalWow/political Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041399#comment-13041399 ] Jason Rutherglen commented on SOLR-2193: bq. IndexWriter writer =((DirectUpdateHandler2)core.getUpdateHandler()).getIndexWriterProvider().getIndexWriter(); Why isn't IW a part of SolrCore? It's the main class running the show. How can there be a Solr core without an IW? I think IW never gets closed until the SolrCore is closed. The next move would be to place all of the caches at the segment level. It's been clear for quite a while that you folks at Lucid are trying to protect your golden goose, eg, Solr from changing much unless dictated by your staff or a paying customer. I think in politics those are called bribes? Hence a large part of the recent fracas regarding modularizing the goose, whose 'resolution' has resulted in no changes. It's astonishing the changes that are OK for Solr by some people, that are no OK from others. This is not a meritocracy. If you insist on driving, you should incorporate some of the feedback given. Solr was hacked together from the beginning and this is yet another ugly retrofit that is being steamrolled in. If you're confident in your abilities you're confident enough to make major changes. I've never seen that on the Solr side of the Lucene project. bq. I remember that issue - I tried to make some comments to help you out with it No there was push back on something silly and simple, eg, getting the IW from the UpdateHandler, just as you have done here. What is the point in contributing when they are blocked for no reason? bq. SOLR-1155 What happened to this poor guys patch? Nothing. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041414#comment-13041414 ] Jason Rutherglen commented on SOLR-2193: Mark, That's an odd non-technical answer, and in the meritocracy of comedy, not funny either. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041422#comment-13041422 ] Jason Rutherglen commented on SOLR-2193: Mark I think you're missing the point. If you're committer then it's implied you review patches and interact with the community, nicely. That's not happening with in this issue, or in Solr as noted by in fact many people. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041428#comment-13041428 ] Jason Rutherglen commented on SOLR-2193: -1 on the patch, I just reviewed again. IndexWriter should be a part of SolrCore (IW is canonical), as we should not be opening and closing IWs in the life of a Solr core. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2193) Re-architect Update Handler
[ https://issues.apache.org/jira/browse/SOLR-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13041433#comment-13041433 ] Jason Rutherglen commented on SOLR-2193: bq. Okay, -1 accepted. You win, good fight Mark this was no fight, this is the open source Apache way. Re-architect Update Handler --- Key: SOLR-2193 URL: https://issues.apache.org/jira/browse/SOLR-2193 Project: Solr Issue Type: Improvement Reporter: Mark Miller Fix For: 4.0 Attachments: SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch, SOLR-2193.patch The update handler needs an overhaul. A few goals I think we might want to look at: 1. Cleanup - drop DirectUpdateHandler(2) line - move to something like UpdateHandler, DefaultUpdateHandler 2. Expose the SolrIndexWriter in the api or add the proper abstractions to get done what we now do with special casing: if (directupdatehandler2) success else failish 3. Stop closing the IndexWriter and start using commit (still lazy IW init though). 4. Drop iwAccess, iwCommit locks and sync mostly at the Lucene level. 5. Keep NRT support in mind. 6. Keep microsharding in mind (maintain logical index as multiple physical indexes) 7. Address the current issues we face because multiple original/'reloaded' cores can have a different IndexWriter on the same index. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2793) Directory createOutput and openInput should take an IOContext
[ https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13040845#comment-13040845 ] Jason Rutherglen commented on LUCENE-2793: -- I already posted a patch to this issue a while back, https://issues.apache.org/jira/secure/attachment/12468030/LUCENE-2793.patch It seems we're looping here. Directory createOutput and openInput should take an IOContext - Key: LUCENE-2793 URL: https://issues.apache.org/jira/browse/LUCENE-2793 Project: Lucene - Java Issue Type: Improvement Components: core/store Reporter: Michael McCandless Assignee: Simon Willnauer Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2793.patch, LUCENE-2793.patch Today for merging we pass down a larger readBufferSize than for searching because we get better performance. I think we should generalize this to a class (IOContext), which would hold the buffer size, but then could hold other flags like DIRECT (bypass OS's buffer cache), SEQUENTIAL, etc. Then, we can make the DirectIOLinuxDirectory fully usable because we would only use DIRECT/SEQUENTIAL during merging. This will require fixing how IW pools readers, so that a reader opened for merging is not then used for searching, and vice/versa. Really, it's only all the open file handles that need to be different -- we could in theory share del docs, norms, etc, if that were somehow possible. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3112) Add IW.add/updateDocuments to support nested documents
[ https://issues.apache.org/jira/browse/LUCENE-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034734#comment-13034734 ] Jason Rutherglen commented on LUCENE-3112: -- I think perhaps like a Hadoop input format split, we can define meta-data at the segment level as to where the documents live so that if one is 'splitting' the index, as is being implemented with HBase, the 'splitter' can be 'smart'. Add IW.add/updateDocuments to support nested documents -- Key: LUCENE-3112 URL: https://issues.apache.org/jira/browse/LUCENE-3112 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3112.patch I think nested documents (LUCENE-2454) is a very compelling addition to Lucene. It's also a popular (many votes) issue. Beyond supporting nested document querying, which is already an incredible addition since it preserves the relational model on indexing normalized content (eg, DB tables, XML docs), LUCENE-2454 should also enable speedups in grouping implementation when you group by a nested field. For the same reason, it can also enable very fast post-group facet counting impl (LUCENE-3097) when you what to count(distinct(nestedField)), instead of unique documents, as your identifier. I expect many apps that use faceting need this ability (to count(distinct(nestedField)) not distinct(docID)). To support these use cases, I believe the only core change needed is the ability to atomically add or update multiple documents, which you cannot do today since in between add/updateDocument calls a flush (eg due to commit or getReader()) could occur. This new API (addDocuments(IterableDocument), updateDocuments(Term delTerm, IterableDocument) would also further guarantee that the documents are assigned sequential docIDs in the order the iterator provided them, and that the docIDs all reside in one segment. Segment merging never splits segments apart, so this invariant would hold even as merges/optimizes take place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs
[ https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019823#comment-13019823 ] Jason Rutherglen commented on LUCENE-2956: -- {quote}Jason I think nothing prevents you from start working on this again Yet, I think we should freeze the branch now and only allow merging, bug fixes, tests and documentation fixes until we land on trunk. Once we are there we can freely push stuff in the branch again and make it work with seq. ids. {quote} OK, great. I remember now that our main concern was the memory usage of using a short[] (for the seq ids) if the total number of documents is numerous (eg, 10s of millions). Also at some point we'd have double the memory usage when we roll over to the next set, until the previous readers are closed. bq. I think we should freeze the branch now and only allow merging, bug fixes, tests and documentation fixes until we land on trunk Maybe once LUCENE-2312 sequence ids work for deletes, we can look at creating a separate branch that implements seq id deletes for all segments, and compare with the BV approach. Eg, performance, memory usage, and simplicity. Support updateDocument() with DWPTs --- Key: LUCENE-2956 URL: https://issues.apache.org/jira/browse/LUCENE-2956 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Realtime Branch Reporter: Michael Busch Assignee: Simon Willnauer Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2956.patch, LUCENE-2956.patch With separate DocumentsWriterPerThreads (DWPT) it can currently happen that the delete part of an updateDocument() is flushed and committed separately from the corresponding new document. We need to make sure that updateDocument() is always an atomic operation from a IW.commit() and IW.getReader() perspective. See LUCENE-2324 for more details. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs
[ https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019370#comment-13019370 ] Jason Rutherglen commented on LUCENE-2956: -- Simon, nice work. I agree with Michael B. that the deletes are super complex. We had discussed using sequence ids for all segments (not just the RT enabled DWPT ones) however we never worked out a specification, eg, for things like wrap around if a primitive short[] was used. Shall we start again on LUCENE-2312? I think we still need/want to use sequence ids there. The RT DWPTs shouldn't have so many documents that using a long[] for the sequence ids is too RAM consuming? Support updateDocument() with DWPTs --- Key: LUCENE-2956 URL: https://issues.apache.org/jira/browse/LUCENE-2956 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Realtime Branch Reporter: Michael Busch Assignee: Simon Willnauer Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2956.patch, LUCENE-2956.patch With separate DocumentsWriterPerThreads (DWPT) it can currently happen that the delete part of an updateDocument() is flushed and committed separately from the corresponding new document. We need to make sure that updateDocument() is always an atomic operation from a IW.commit() and IW.getReader() perspective. See LUCENE-2324 for more details. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019391#comment-13019391 ] Jason Rutherglen commented on LUCENE-2312: -- In the current patch, I'm copying the parallel array for the end of a term's postings per reader [re]open. However in the case where we're opening a reader after each document is indexed, this is wasteful. We can simply queue the term ids from the last indexed document, and only copy the newly updated values over to the 'read' only consistent parallel array. Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: Realtime Branch Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: Realtime Branch Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs
[ https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13018645#comment-13018645 ] Jason Rutherglen commented on LUCENE-2956: -- I think I have an idea, however can you explain the ticketQueue? Support updateDocument() with DWPTs --- Key: LUCENE-2956 URL: https://issues.apache.org/jira/browse/LUCENE-2956 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Realtime Branch Reporter: Michael Busch Assignee: Simon Willnauer Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2956.patch With separate DocumentsWriterPerThreads (DWPT) it can currently happen that the delete part of an updateDocument() is flushed and committed separately from the corresponding new document. We need to make sure that updateDocument() is always an atomic operation from a IW.commit() and IW.getReader() perspective. See LUCENE-2324 for more details. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017886#comment-13017886 ] Jason Rutherglen commented on LUCENE-2186: -- bq. changing this to a random access seekable API should be not too hard I think we can offer the option of MMap'ing the field caches, which I think will help alleviate OOMs? First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017679#comment-13017679 ] Jason Rutherglen commented on LUCENE-2186: -- I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs
[ https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017084#comment-13017084 ] Jason Rutherglen commented on LUCENE-2956: -- What is the status of this one? If no one's working on it, I can take a stab. Support updateDocument() with DWPTs --- Key: LUCENE-2956 URL: https://issues.apache.org/jira/browse/LUCENE-2956 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: Realtime Branch Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch With separate DocumentsWriterPerThreads (DWPT) it can currently happen that the delete part of an updateDocument() is flushed and committed separately from the corresponding new document. We need to make sure that updateDocument() is always an atomic operation from a IW.commit() and IW.getReader() perspective. See LUCENE-2324 for more details. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2573) Tiered flushing of DWPTs by RAM with low/high water marks
[ https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014016#comment-13014016 ] Jason Rutherglen commented on LUCENE-2573: -- bq. influenced due to the fact that flushing is very very CPU intensive Do you think this is due mostly to the vint decoding? We're not interleaving postings on flush with this patch so the CPU consumption should be somewhat lower. bq. At the same time CMS might kick in way more often since we are writing more segments which are also smaller compared to trunk This's probably the more likely case. In general, we may be able to default to a higher overall RAM buffer size, and perhaps there won't be degradation in indexing performance like there is with trunk? In the future with RT we could get fancy and selectively merge segments as we're flushing, if writing larger segments is important. I'd personally prefer to write out 1-2 GB segments, and limit the number of DWPTs to 2-3, mainly for servers that are concurrently indexing and searching (eg, the RT use case). I think the current default number of thread states is a bit high. Tiered flushing of DWPTs by RAM with low/high water marks - Key: LUCENE-2573 URL: https://issues.apache.org/jira/browse/LUCENE-2573 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Simon Willnauer Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch Now that we have DocumentsWriterPerThreads we need to track total consumed RAM across all DWPTs. A flushing strategy idea that was discussed in LUCENE-2324 was to use a tiered approach: - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM) - Flush all DWPTs at a high water mark (e.g. at 110%) - Use linear steps in between high and low watermark: E.g. when 5 DWPTs are used, flush at 90%, 95%, 100%, 105% and 110%. Should we allow the user to configure the low and high water mark values explicitly using total values (e.g. low water mark at 120MB, high water mark at 140MB)? Or shall we keep for simplicity the single setRAMBufferSizeMB() config method and use something like 90% and 110% for the water marks? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012659#comment-13012659 ] Jason Rutherglen commented on LUCENE-3003: -- {quote}Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting{quote} I'd guess that pulsing should be 'good enough' most of the time? It seems like there'll be some overlap in terms of the gains from pulsing vis-à-vis DocValues? Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3003) Move UnInvertedField into Lucene core
[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012687#comment-13012687 ] Jason Rutherglen commented on LUCENE-3003: -- bq. Ie Pulsing is good for terms that have only 1 or 2 docs I thought the default is 16 docs? If there are more then seek'ing to the postings should be negligible (in comparison to a larger aggregate index size when using CSF/DocValues, which'll consume more of the system IO cache)? Move UnInvertedField into Lucene core - Key: LUCENE-3003 URL: https://issues.apache.org/jira/browse/LUCENE-3003 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-3003.patch Solr's UnInvertedField lets you quickly lookup all terms ords for a given doc/field. Like, FieldCache, it inverts the index to produce this, and creates a RAM-resident data structure holding the bits; but, unlike FieldCache, it can handle multiple values per doc, and, it does not hold the term bytes in RAM. Rather, it holds only term ords, and then uses TermsEnum to resolve ord - term. This is great eg for faceting, where you want to use int ords for all of your counting, and then only at the end you need to resolve the top N ords to their text. I think this is a useful core functionality, and we should move most of it into Lucene's core. It's a good complement to FieldCache. For this first baby step, I just move it into core and refactor Solr's usage of it. After this, as separate issues, I think there are some things we could explore/improve: * The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... * We can improve the RAM efficiency of the TermIndex: if the codec supports ords, and we are operating on one segment, we should just use it. If not, we can use a more RAM-efficient data structure, eg an FST mapping to the ord. * We may be able to improve on the main byte[] representation by using packed ints instead of delta-vInt? * Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005631#comment-13005631 ] Jason Rutherglen commented on LUCENE-2324: -- bq. I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high... However we're not sure why a largish 1+ GB RAM buffer seems to slow down? If we're round robin indexing against the DWPTs I think they'll have a similar number of unique terms as today, even though each DWPT will be smaller in size total size from each containing 1/Nth docs. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, lucene-2324.patch, lucene-2324.patch, test.out, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005651#comment-13005651 ] Jason Rutherglen commented on LUCENE-2324: -- {quote}Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The normal merge that will merge these DWPTs after this issue lands just append by docIDs).{quote} Right, this is the same principal motivation behind implementing DWPTs for use with realtime search, eg, the doc-id interleaving is too expensive to be performed at query time. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, lucene-2324.patch, lucene-2324.patch, test.out, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005159#comment-13005159 ] Jason Rutherglen commented on LUCENE-2324: -- Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else? Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, lucene-2324.patch, lucene-2324.patch, test.out, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005177#comment-13005177 ] Jason Rutherglen commented on LUCENE-2324: -- {quote}Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required.{quote} I'm not sure how we handled concurrency on the terms hash before, however with DWPTs there won't be contention regardless. It'd be nice if we could build 1-2 GB segment's in RAM, I think that would greatly reduce the number merges that are required downstream. Eg, then there's less need for merging by size, and most merges would be caused by the number/percentage of deletes. If it turns out the low DF terms are causing the slowdown, maybe there is a different hashing system that could be used. Per thread DocumentsWriters that write their own private segments - Key: LUCENE-2324 URL: https://issues.apache.org/jira/browse/LUCENE-2324 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Realtime Branch Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, LUCENE-2324.patch, lucene-2324.patch, lucene-2324.patch, test.out, test.out, test.out, test.out See LUCENE-2293 for motivation and more details. I'm copying here Mike's summary he posted on 2293: Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and normal segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO CPU. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2919) IndexSplitter that divides by primary key term
[ https://issues.apache.org/jira/browse/LUCENE-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-2919: - Attachment: LUCENE-2919.patch First cut. Roughly divides an index by the exclusive mid term given. IndexSplitter that divides by primary key term -- Key: LUCENE-2919 URL: https://issues.apache.org/jira/browse/LUCENE-2919 Project: Lucene - Java Issue Type: Improvement Affects Versions: 4.0 Reporter: Jason Rutherglen Priority: Minor Attachments: LUCENE-2919.patch Index splitter that divides by primary key term. The contrib MultiPassIndexSplitter we have divides by docid, however to guarantee external constraints it's sometimes necessary to split by a primary key term id. I think this implementation is a fairly trivial change. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org