[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13057538#comment-13057538 ] Michael McCandless commented on LUCENE-2454: bq. Do you think there any efficiencies to be gained on the document retrieve side of things if you know that the documents commonly being retrieved are physically nearby Good question! I think OS level caching should mostly solve this? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053142#comment-13053142 ] Mark Harwood commented on LUCENE-2454: -- bq. Could that work for your use case? Sounds like it, that's great :) Do you think there any efficiencies to be gained on the document retrieve side of things if you know that the documents commonly being retrieved are physically nearby i.e. an app will often retrieve a parent's fields and then those from child docs which are required to be physically located adjacent to the parent's data. Would existing lower-level caching in Directory or the OS mean there's already a good chance of finding child data in cached blocks or could a change to file structures and/or doc retrieve APIs radically boost parent-plus-child retrieve performance? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053663#comment-13053663 ] Srinivas Raj commented on LUCENE-2454: -- This is exactly what I am looking for, hope this becomes part of core. How to make this work with Lucene 3.2? I downloaded the zip file and I was able to run the test with lucene 3.0, but I would like to use the addDocuments() method added to Lucene 3.2. The patches seems to be specific to Lucene 4.0. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052409#comment-13052409 ] Paul Elschot commented on LUCENE-2454: -- This overlaps with the BlockJoinQuery of LUCENE-3171, this issue might even be closed as duplicate of that one. Which one is preferred? On using prev/nextSetBit in a safe range, this safe range starts with the parent and ends with the largest known child. A variant of prevSetBit could take this largest known child as an argument to limit its search, and then from the return value one has either a new parent, or one is certain that the current parent is the right one. This would also limit the worst case number of inspected bits for the group to the group size. With or without that variant, I think it would be good to add a remark in the javadocs about the possible inefficiency of the use of OpenBitSet for larger group sizes. When the typical group size gets a lot bigger than the number of bits in a long, another implementation might be faster. This remark the in javadocs would allow us to wait for someone to come along with bigger group sizes and a real performance problem here. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052436#comment-13052436 ] Mark Harwood commented on LUCENE-2454: -- bq. This overlaps with the BlockJoinQuery of LUCENE-3171, this issue might even be closed as duplicate of that one. Which one is preferred? We need to look at the likely use cases. 2454 was created to service a use case which I expect to be a very common pattern and I'm not sure if LUCENE-3171 satisfies this need. Apps commonly need to return a selection of both matching and non-matching children along with the best parents. Why? - it's a very similar rationale to the way that highlighting returns a summary of text - it doesn't just return the matched words, it also returns surrounding text as useful context when displaying results to users. However, some texts can be very large and there's a need to limit what context is brought back. If we apply this logic to 2454 we can see that for the top parents it is common to also want some non-matching children (e.g. for a resume return a person's employment history - not just the employments that matched the original search) but it is also necessary to summarize some parent's history (e.g. the contractor who listed a gazillion positions in his employment history needs summarising). A common pattern is for solutions to ask for the best 11 children for the best parents and display only 10 - that way the app knows that for certain parents there is more data available (i.e. those with 11 matches) and can offer a more button to retrieve the extra children for parents of interest. 2454 satisfies this use case as follows: # Use a NestedDocumentQuery to get best parents with child criteria expressed as a must # Use a PerParentLimitedQuery to get a selection of children per top parent where MUST belong to a top parent (tested using primary key) and use the child criteria again but this time as a SHOULD clause to relevance rank the selection of children returned It's worth considering this sort of use case carefully before making any code decisions. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052459#comment-13052459 ] Michael McCandless commented on LUCENE-2454: {quote} bq. It uses 2 passes if you also want to collect child docs per parent I tend to work with distributed indexes so it involves a 2 pass op anyway - one to understand best parents across the multiple shards first then the perparentlimitedquery to ensure we only pay the retrieve costs for those parents that make the final cut. {quote} The distributed case can still be done single pass, using LUCENE-3171, ie each shard returns the top groups and then they are merged in the front. This should be substantially faster than doing a 2nd pass out to all shards. Also, we now have TopDocs.merge/TopGroups.merge to support this use case. bq. This overlaps with the BlockJoinQuery of LUCENE-3171, this issue might even be closed as duplicate of that one. Which one is preferred? I think they are likely dups of one another and I agree we need to make sure all important use cases are covered. bq. Apps commonly need to return a selection of both matching and non-matching children along with the best parents. LUCENE-3171 can do this as well, with the same approach as here, ie doing 2 passes with two different child queries. However, I think for both this issue and for LUCENE-3171, this means each child doc must have the parent's PK indexed against it, right? Ie, for that 2nd query you need some way to return all child docs under any of the top parents, so the child query is parentID MUST be in XX, YY, ZZ and childDoc SHOULD XYZ. In fact, we could make this a single pass capability with LUCENE-3171 and without requireing each child doc index its parent PK, ie also pull sort all other non-matching children under any top parent, because collction within each parent is done when you retrieve the TopGroups, but this can be a later enhancement. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052644#comment-13052644 ] Michael McCandless commented on LUCENE-2454: bq. A variant of prevSetBit could take this largest known child as an argument to limit its search, I think we should not require the app to know the max number of children per parent? (Ie, we should just grow buffers, etc., on demand as we collect). I mean, if this information is easily available we could optimize for that case, but for some apps it's a good amount of work to record this and update it so I don't think it should be a required arg when creating the query/collectors, even though it's tempting ;) Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052648#comment-13052648 ] Michael McCandless commented on LUCENE-2454: bq. A common pattern is for solutions to ask for the best 11 children for the best parents and display only 10 - that way the app knows that for certain parents there is more data available (i.e. those with 11 matches) and can offer a more button to retrieve the extra children for parents of interest With LUCENE-3171, you should be able to just ask for 10 here, and then check if the TopDocs.totalHits is 10 to decide whether to offer the more button. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052696#comment-13052696 ] Michael McCandless commented on LUCENE-2454: bq. I think the only thing 3171 may be missing from my original use cases then is that I can use multiple PerParentLimitedQueries in one query to get a limit of children of different types e.g. for each parent resume, max 10 results from employment detail children and max 10 results from education background children. I think LUCENE-3171 can handle this, or something very similar: the collector tracks all of the BlockJoinQuerys involved in the top query. So, you'd have 1 BJQ matching employment detail child docs and another matching education bg child docs. The BJC collects the top parent docs, then you can retrieve separate TopGroups for each BJQ. In the end you have a TopGroups for the employment detail child docs and another TopGroups for the education bg child docs. Could that work for your use case? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052194#comment-13052194 ] Michael McCandless commented on LUCENE-2454: bq. Would modules/grouping meanwhile be a better place for this than lucene/contrib/queries? I think modules/join is the right place? When we factor out Solr's generic join impl it can go there too... I have some concerns about the current approach here (this is why I opened LUCENE-3171): * prevSetBit is called for each child doc, which is an O(N^2) cost (N = number of child docs for one parent) I think? Admittedly, typically N is probably small... * It uses 2 passes if you also want to collect child docs per parent * PerParentLimitedQuery is also O(N^2) cost, both on insert of a new child and on popping the child docs per group: I think it should use a PQ to find the lowest child to evict per parent doc? * I think typically an app will want to collect the top N groups (parent docs and their children), so it's more efficient to gather those top N and only in the end sort the each set of children per-parent? (This is similar to how 2nd pass grouping collector works). * PerParentLimitedQuery only supports relevance sort w/in each parent. * You don't get the parent/child structure back, from PerParentLimitedQuery (but now we have TopGroups which is a great match for representing each parent and its children). If you always only use PerParentLimitedQuery on the top parents from the first pass, eg you AND/filter it against those parent docs, then the O(N^2) cost is less severe since it'll have a small constant in front, but since it's a Query I imagine users will use it w/o that filter, which is bad... I think using a TopN Collector is a better match here. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052223#comment-13052223 ] Mark Harwood commented on LUCENE-2454: -- bq. prevSetBit is called for each child doc You could call nextSetBit on the first child to know the safe range of child docs attributable to the same parent but you would be taking a gamble that this was worth the call i.e. there were many possible children per parent to be tested. bq. It uses 2 passes if you also want to collect child docs per parent I tend to work with distributed indexes so it involves a 2 pass op anyway - one to understand best parents across the multiple shards first then the perparentlimitedquery to ensure we only pay the retrieve costs for those parents that make the final cut. bq. I think it should use a PQ to find the lowest child to evict per parent doc? Careful object reuse would need to be factored in to avoid excessive GC - each parent would fill a PQ full of child-match object instances that could/should be reused in assessing the next parent Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051662#comment-13051662 ] Paul Elschot commented on LUCENE-2454: -- With these rewrite and createWeight methods TestNestedDocumentQuery passes: {code} + @Override + public Query rewrite(IndexReader reader) throws IOException { +Query rewrittenChildQuery = childQuery.rewrite(reader); +return (rewrittenChildQuery == childQuery) ? this + : new NestedDocumentQuery(rewrittenChildQuery, parentsFilter, scoreMode); + } + + @Override + public Weight createWeight(IndexSearcher searcher) throws IOException { +return new NestedDocumentQueryWeight(childQuery.createWeight(searcher)); + } + {code} I'll continue adding the use of prevSetBit. Would modules/grouping meanwhile be a better place for this than lucene/contrib/queries? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051673#comment-13051673 ] Paul Elschot commented on LUCENE-2454: -- The assert on the parent was an IllegalArgumentException in the previous patch. Such and unconditional exception would probably be better than an assert, because when the assert is switched off a mistake in the parent filter would not be detected early. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051484#comment-13051484 ] Paul Elschot commented on LUCENE-2454: -- Tried the current patch here to make use prevSetBit, but ran into a problem with the query weight that could be related to LUCENE-3208. When fixing the patch here so that NestedDocumentQuery.java looks like this: {code} public Weight createWeight(IndexSearcher searcher) throws IOException { return new NestedDocumentQueryWeight(childQuery.createWeight(searcher)); } {code} the TestNestedDocumentQuery from the patch here fails with an UnsupportedOperationException. After adding the class name to Query.java constructing this exception the test fails by: UnsupportedOperationException: org.apache.lucene.search.NumericRangeQuery That means that probably the above fix to the patch is wrong. Any comments on how to continue this? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051486#comment-13051486 ] Michael McCandless commented on LUCENE-2454: I suspect the NestedDocumentQuery must impl rewrite, and rewrite the childQuery. I hit this on LUCENE-3171, too. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051495#comment-13051495 ] Paul Elschot commented on LUCENE-2454: -- NestedDocumentQuery already implements rewrite() by returning *this*, just as in 3171. This is a more complete traceback of exception: {noformat} [junit] java.lang.UnsupportedOperationException: org.apache.lucene.search.NumericRangeQuery [junit] at org.apache.lucene.search.Query.createWeight(Query.java:91) [junit] at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:177) [junit] at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:358) [junit] at org.apache.lucene.search.nested.NestedDocumentQuery.createWeight(NestedDocumentQuery.java:65) [junit] at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:177) [junit] at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:358) [junit] at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:676) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:292) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:281) [junit] at org.apache.lucene.search.TestNestedDocumentQuery.testSimple(TestNestedDocumentQuery.java:92) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1414) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1332) {noformat} Could BooleanWeight be the offendor? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051502#comment-13051502 ] Paul Elschot commented on LUCENE-2454: -- One of the nocommits in the patch is about the use of an Filter for the parent filter. NesteDocumentQuery uses an OpenBitSet from this Filter for next() and advance() just like a Filter and also as a parent filter. So how about adding sth like this: {code} public abstract class ParentFilter { public abstract ParentDISI getParentDISI(IndexReader reader); } public class ParentDISI extends DocIdSetIterator { public int getParent(); // to be used only after next() or advance() returned NO_MORE_DOCS } {code} together with another constructor for NestedDocumentIterator with a ParentFilter argument? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13051611#comment-13051611 ] Paul Elschot commented on LUCENE-2454: -- At Query, the javadocs of both createWeight() and rewrite() start with a word of warning. I'll probably need at least a few days to wrap my head around it, so in case anyone meanwhile can provide more help... Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045314#comment-13045314 ] Paul Elschot commented on LUCENE-2454: -- That is very nicely readable XML. The problem might occur when a document with an optional term occurs before a document in the same group with a required term. So the second question is the one for which the problem might occur. The score value Grant's resume should then be higher than the score value for Sean's. Testing only for the set of expected results is not enough for this particular query. The problem might occur in another disguise when requiring both terms and then the set of expected results is enough to test, but this is not as easily tested because one does not know beforehand the order in which the terms are going to be advance()d. The case with an optional term is simpler to test because the optional term is certain to be advance()d to compute the score value after the required term determines that there is a match (see ReqOptSumScorer.score()), and then to be certain of the correct advance() on the optional term one needs to test the score value. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045319#comment-13045319 ] Paul Elschot commented on LUCENE-2454: -- Looking at the structure of the BooleanQuery, I would expect this to work correctly. The ParentsFilter on the unfiltered scorer of required term (mahout) should return the docId of the parent (resume) when the unfiltered scorer is at the document containing the required term. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045334#comment-13045334 ] Mark Harwood commented on LUCENE-2454: -- bq. Looking at the structure of the BooleanQuery, I would expect this to work correctly. I've found it to be robust so far - you just need to be clear about directing criteria at only one child or potentially different children. The main challenge in using this functionality is allowing users to articulate the nuances of such queries and Lucene-3133 is a holding place for this. Under the covers using the same cached filter for parent filters certainly helps with performance and I typically wrap the ParentFilter tag in the XML queries with a CachedFilter tag to achieve this Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13045561#comment-13045561 ] Paul Elschot commented on LUCENE-2454: -- So one concern that is left is performance for parent testing. I'll open an issue for OpenBitSet.prevSetBit(). Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044828#comment-13044828 ] Mark Harwood commented on LUCENE-2454: -- Below are 2 example tests searching employment resumes - both using the same optional and mandatory clauses but in subtly different ways. Question 1 is who has Mahout skills and preferably used them at Lucid? while the other question is who has Mahout skills and preferably has been employed by Lucid?. The questions and the answers are different. Below is the XML test script I used to illustrate the data/queries used, define expected results and run as an executable test. Hopefully you can make sense of this: {code:xml} ?xml version=1.0 encoding=UTF-8? ?xml-stylesheet type=text/xsl href=test.xsl? Test description=NestedQuery tests Data Index name=ResumeIndex Analyzers class=org.apache.lucene.analysis.WhitespaceAnalyzer /Analyzers Shard name=shard1 !-- === -- Document pk=1 Field name=namegrant/Field Field name=docTyperesume/Field /Document !-- === -- Document pk=2 Field name=employerlucid/Field Field name=docTypeemployment/Field Field name=skillsjava lucene/Field /Document !-- === -- Document pk=3 Field name=employersomewhere else/Field Field name=docTypeemployment/Field Field name=skillsmahout and more mahout/Field /Document !-- === -- Document pk=4 Field name=namesean/Field Field name=docTyperesume/Field /Document !-- === -- Document pk=5 Field name=employerfoo bar/Field Field name=docTypeemployment/Field Field name=skillsjava/Field /Document !-- === -- Document pk=6 Field name=employersome co/Field Field name=docTypeemployment/Field Field name=skillsmahout mahout and more mahout/Field /Document /Shard /Index /Data Tests Test description=Who knows Mahout and preferably used it *while employed at Lucid*? Query NestedQuery !-- testing properties of individual child employment docs -- Query BooleanQuery Clause occurs=must TermsQuery fieldName=skillsmahout/TermsQuery /Clause Clause occurs=should TermsQuery fieldName=employerlucid/TermsQuery /Clause /BooleanQuery /Query ParentsFilter TermsFilter fieldName=docTyperesume/TermsFilter /ParentsFilter /NestedQuery /Query ExpectedResults why=Grant's tenure at Lucid is
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044608#comment-13044608 ] Paul Elschot commented on LUCENE-2454: -- I finally had some time to start taking a look at the grouping module and again at the patch here. There is too much code there for me to come up with a test case soon. So please don't wait for me to commit this. An easy way to test this would be to have a boolean query with required term and an optional term, with the optional term occurring the in a document group in a document before (i.e. with a lower docId than) a document in the same group with a required term. In case I run into this I'll open a separate issue. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13044326#comment-13044326 ] Michael McCandless commented on LUCENE-2454: OK I opened LUCENE-3171 to explore the single-pass approach. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039623#comment-13039623 ] Michael McCandless commented on LUCENE-2454: bq. I'll need to check LUCENE-3129 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents. Hmm, so I wonder if we could do this in one pass? Ie, like grouping, if you indexed your docs as blocks, you can use the faster single-pass collector; but if you didn't, you can use the more general but slower and more-RAM-consuming two pass collector. It seems like we should be able to do something similar with joins, somehow... ie Solr's join impl is a start at the fully general two-pass solution. But I agree the join child to parent and then grouping of child docs go hand in hand for searching... What do you do for facet counting in these apps...? Post-grouping faceting also ties in here. bq. Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required Right... bq. (note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books). I think it'd be compelling to index book/articles with each page/section/chapter being a new doc, and then group them under their book/article. bq. I haven't benchmarked nextSetBit vs the existing rewind implementation but I imagine it may be quicker. I think it should be much faster -- obs.nextSetBit looks heavily optimized, since it can operate a word at a time. Though, if the groups are smallish, so that nextSetBit is often maybe 2 or 3 bits away, I'm not sure it'd be faster... bq. Parent- followed-by-children seems more natural from a user's point of view however. But is it really so bad to ask the app to put parent doc last? I mean, the docs have to be indexed w/ the new doc block APIs in IW anyway, which will often be eg a ListDocument, at which point putting parent last seems a minor imposition? Since this is an expert API I think it's OK to put [minor] impositions on its usage if this can simplify the impl / make it faster / less risky. That said, I'm not yet sure on the impl (single pass query + collector vs generic two-pass join that solr now has), so it's probably premature to worry about this... bq. I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster. True but this adds some hair into the impl (we must also flip coming back from nextSetBit)... bq. Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all. True, though it'd be best to do this in the context of the actual join impl... Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039641#comment-13039641 ] Paul Elschot commented on LUCENE-2454: -- I see no test cases for required terms in a nested document. This may be non trivial in that advance() should advance into the first doc of the nested doc. For example, assume the parents p1 and p2 are the first docs in the nested docs, and that the query requires a and b to be present: {noformat} docId 0 p1 1 a 2 b 3 p2 4 b 5 a {noformat} In this situation, p2 may be missed when advance() on a required scorer for b is given docId 5 (containing a) as a target. It should be given target docId 3 to advance into the nested doc p2 containing a. I quickly read the code here, but I could not easily determine whether this is done correctly or not. Shall I add a test case here, or would it be better to open another issue after this one is closed, or can someone reassure me that this is not in an issue? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038415#comment-13038415 ] Thomas Guttesen commented on LUCENE-2454: - Hi. Great feature... I have some difficulties understanding the semantics/flow of document creation. Do you have to add the parent and child levels in any correct sequence? Or can you insert all parents and then insert all child levels later. The reason I as is that in my case I look for a one-many relation style insertion. I had hoped that I could add more child levels later. Cheers Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038460#comment-13038460 ] Mark Harwood commented on LUCENE-2454: -- Thanks for the patch work, Mike. I'll need to check LUCENE-3129 for equivalence with PerParentLimitQuery. It's certainly a central part of what I typically deploy for nested queries - pass 1 is usually a NestedDocumentQuery to get the best parents and pass 2 uses PerParentLimitQuery to get the best children for these best parents. Of course some apps can simply fetch ALL children for the top parents but in some cases summarising children is required (note: this is potentially a great solution for performance issues on highlighting big docs e.g. entire books). I haven't benchmarked nextSetBit vs the existing rewind implementation but I imagine it may be quicker. Parent- followed-by-children seems more natural from a user's point of view however. I guess you could always keep the parent-then-child insertion order but flip the bitset (then cache) for query execution if that was faster. Benchmarking rewind vs nextSetbit vs flip then nextSetBit would reveal all. Thomas - maintaining a strict order of parent/child docs is important and the recently-committed LUCENE-3112 should help with this. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LUCENE-2454.patch, LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034702#comment-13034702 ] Michael McCandless commented on LUCENE-2454: I think this is a very important addition to Lucene, so let's get this done! I just opened LUCENE-3112, to add IW.add/updateDocuments, which would atomically add Document produced by an iterator, and ensure they all wind up in the same segment. I think this is the only core change necessary for this feature? Ie, all else can be built on top of Lucene once LUCENE-3112 is committed? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034726#comment-13034726 ] Mark Harwood commented on LUCENE-2454: -- bq. I think this is the only core change necessary for this feature? Yup. A same-segment indexing guarantee is all that is required. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: core/search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012501#comment-13012501 ] RynekMedyczny.pl commented on LUCENE-2454: -- {quote} Code like this ends up in trunk when there is concensus so your support is welcome. {quote} Of course! How can we help you? {quote} While core Lucene adoption is a relatively simple technical task {quote} We are eagerly waiting for incorporating your work into Lucene Core! Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13010110#comment-13010110 ] Mark Harwood commented on LUCENE-2454: -- bq. I have not looked this patch so this comment may be off base. The slideshare deck gives a good overview: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene As a simple Lucene-focused addition I'd prefer not to explore all the possible implications for Solr adoption here. The affected areas in Solr are extensive and would include schema definitions, query syntax, facets/filter caching, result-fetching, DIH etc etc. Probably best discussed elsewhere. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009985#comment-13009985 ] Ryan McKinley commented on LUCENE-2454: --- bq. Solr, however does introduce a schema and much more that assumes a flat model. In SOLR-1566 we could add a DocList as a field within a SolrDocument -- this would at least allow the output format to return a nested structure. I have not looked this patch so this comment may be off base. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009071#comment-13009071 ] RynekMedyczny.pl commented on LUCENE-2454: -- Mark, do you have any plans for including this feature into the Lucene trunk? I think that this is a must have feature since tree structures are so common! Thank you in advance. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-2454) Nested Document query support
On 3/21/11 10:51 AM, Dawid Weiss wrote: Is it just me, or was that last e-mail sent with the header: From: RynekMedyczny.pl (JIRA)j...@apache.org JIRA comment notifications put username in front of JIRA's own address. Apparently someone uses RynekMedyczny.pl as their username. This is weird :) I, for one, welcome RynekMedyczny.pl as a Solr user :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-2454) Nested Document query support
Oh, in this case I also welcome RynekMedyczny.pl as a Solr user ;) Dawid P.S. RynekMedyczny ~= HealthCareMarket On Mon, Mar 21, 2011 at 11:07 AM, Andrzej Bialecki a...@getopt.org wrote: On 3/21/11 10:51 AM, Dawid Weiss wrote: Is it just me, or was that last e-mail sent with the header: From: RynekMedyczny.pl (JIRA)j...@apache.org JIRA comment notifications put username in front of JIRA's own address. Apparently someone uses RynekMedyczny.pl as their username. This is weird :) I, for one, welcome RynekMedyczny.pl as a Solr user :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009111#comment-13009111 ] Mark Harwood commented on LUCENE-2454: -- bq. Mark, do you have any plans for including this feature into the Lucene trunk? That is my intention in providing it here. I had to work hard to convince my employer to let me release this as open source in the interests of seeing it updated/tested as core Lucene APIs change - and hopefully receive some improved support in IndexWriter flush control. Unfortunately it seems not everyone shares the pain when it comes to modelling richer data structures and seem content with the flat model we have in Lucene today. Code like this ends up in trunk when there is concensus so your support is welcome. While core Lucene adoption is a relatively simple technical task, Solr adoption feels like a much more disruptive change. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009141#comment-13009141 ] Jamal Natour commented on LUCENE-2454: -- Mark, For my project this is a must have feature that could decide the adoption of SOLR. What do think is the best way to help ensure this gets incorporated into SOLR? Thank you, Jamal Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13009163#comment-13009163 ] Mark Harwood commented on LUCENE-2454: -- Lucene does not dictate a schema and so using this approach to index design/querying is not a problem with base Lucene. Solr, however does introduce a schema and much more that assumes a flat model. In the opening chapters of the Solr 1.4 Enterprise Search Server book the authors take the time to discuss the modelling limitations of this flat model and acknowledge this as an issue. The impact of adopting nested documents in Solr at this stage would be very large. There may be ways you can overcome some of your issues without requiring nested documents (using phrase/span queries or combining tokens from multiple fields in Solr) but in my experience these are often poor alternatives if richer structures are important. Cheers Mark Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000239#comment-13000239 ] Mark Harwood commented on LUCENE-2454: -- Hi Paul, I'm not sure I currently have an issue with merges as they just concatenate established segments without interleaving their documents. This operation should retain the order that is crucial to maintaining the parent/child/grandchild relationships (unless something has changed in merge logic which would certainly be an issue!). My main cause for concern is robust control over flushes so parent/child docs don't end up being separated into different segments at the point of arbitrary flushes. I think your proposal here is related to a new (to me) use case where clients can add a single new child document and the index automagically reorganises to assemble all prior related documents back into a structure where they are grouped as contiguous documents held in the same segment? Please correct me if I am wrong. Previously I have always seen this need for reorganisation as an application's responsibility and a single child document addition required the app to delete the associated parent and all old child docs, then add a new batch of documents representing the parent, old children plus the new child addition. Given the implied deletes and inserts required to maintain relationship integrity that seems like an operation that needs to be done under the control of Lucene's transaction management APIs rather than some form of special MergePolicy which are really intended for background efficiency tidy-ups not integrity maintenance. As for the fields you outline for merging , generally speaking in applications using NestedDocumentQuery and PerParentLimitedQuery I have found that for searching purposes I already need to store: 1) A globally unique ID as an indexed primary key field on the top-level container document 2) An indexed field with the same unique ID held in a different foreign key field on child documents 3) An indexed field indicating the document type e.g root or resume and level1Child or employmentRecord I could be a little confused about your intentions - maybe should we start with what problem we are trying to solve before addressing how we achieve it? Cheers Mark Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000444#comment-13000444 ] Mark Harwood commented on LUCENE-2454: -- bq. The intention is quite simple: allow a set of documents to be used to provide a single score value during query searching That's what the existing NestedDocumentQuery code attached to this issue already provides. As far as I am concerned the search side works fine and I have it installed in several live installations (along with a bug fix for skip that I must remember to upload here). Parent filters as you suggest benefit from caching and I typically use the XMLQueryParser with a CachedFilter tag to take care of that (I need to upload the XMLQueryParser extensions for this Nested stuff too). The new intention that I think you added in your last post was more complex and is related to indexing, not searching and introduced the idea that adding a new child doc on its own should somehow trigger some automated repair of the index contents. This repair would involve ensuring that related documents from previous adds would be reorganised such that all related documents still remained physically next to each other in the same segment. I don't think a custom choice of MergePolicy is the class to perform this operation - they are simply consulted as an advisor to pick which segments are ripe for a background merge operation conducted elsewhere. The more complex merge task you need to be performed here requires selective deletes of related docs from existing segments and addition of the same documents back into a new segment. This is a task I have always considered something the application code should do rather than relying on Lucene to second-guess what index reorganisation may be required. We could try make core Lucene understand and support parent/child relationships more fully but I'd settle for this existing approach with some added app-control over flushing as a first step. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000503#comment-13000503 ] Paul Elschot commented on LUCENE-2454: -- So the missing basic operation is a copy/append of a range of existing index docs. After that operation, the original docs can be deleted, but that is trivial. I'll have a look at IndexWriter for this over the coming days. Any quick hints? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000541#comment-13000541 ] Mark Harwood commented on LUCENE-2454: -- I'm not sure the auto-repair is that trivial. Let's say the parent/child docs are resumes and nested docs for employment positions (as in the attached example). An update may not just be adding a new employment position doc but editing an existing one, deleting an old one etc. Your auto-updater is going to need to do a lot of figuring out to work out which existing docs need copying over from earlier segments and patching in to the new segment with the updated parts of the resume. This gets worse if we start to consider multiple levels to the hierarchy. It all feels like a lot of work for the IndexWriter to take on? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1300#comment-1300 ] Paul Elschot commented on LUCENE-2454: -- How about an implementation for strict hierarchies that uses two fields per document, in the following way: The two fields each contain a single (indexed) token that indicates the node in the nesting hierarchy, one field meaning that the document is a child of that node, and the other that the document is the representative of that node. Any number of levels could be allowed, but no cycles of course. These fields are then used by a merge policy to keep the documents ordered postorder, that is the children immediately followed by the representative for each node. Collecting scores at any node in the hierarchy could then be done by using term filters, one for each involved scorer, to provide the representative for the current doc by advancing. For example, in index order: userDocId nodeMemberField nodeReprField doc1 nodeA1 . doc2 nodeA1 . doc3 nodeA nodeA1 doc4 nodeA2 . doc5 nodeA2 . doc6 nodeA nodeA2 The node representatives for scoring could then be obtained by a term filter for nodeA. I think this could work for the scoring part, basically along the lines of the code already posted here. Could someone with more experience in segment merge policies comment on this? This is quite restrictive for merging as the only freedom that is left in the document order is the order of the children for each node. For example, adding a leaf document doc7 for nodeA1 could result in the following index order: doc4 nodeA2 . doc5 nodeA2 . doc6 nodeA nodeA2 doc7 nodeA1 . doc1 nodeA1 . doc2 nodeA1 . doc3 nodeA nodeA1 Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889088#action_12889088 ] Michael McCandless commented on LUCENE-2454: Maybe we should add an addDocuments call to IW? To add more than one document, atomically, so that any flush must happen before or after them? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip, TestNestedDocumentQueryWithMultiSegments.java A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889104#action_12889104 ] Mark Harwood commented on LUCENE-2454: -- bq. Maybe we should add an addDocuments call to IW? To add more than one document, atomically, so that any flush must happen before or after them? That would be nice. Another way of modelling this would be to introduce Document.add(Document childDoc) but I think that is a more fundamental and wide-reaching change. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889215#action_12889215 ] Buddika Gajapala commented on LUCENE-2454: -- Mark, that was fast :) BTW another scenario, when there are lot of data, there is a posibility of having parent docuemnt and matching child document in two different segments causing to miss some matches. I made a minor modification your approch by making it do a Forward-scan instead of reverse scan for parent docs and have the parent document inserted AFTER the child docs are inserted and in case of parent doc is located outside the scop of current doc, it's docid is preserved at the Weight Object level and nextDoc() modified to check fo that for the very 1st nextDoc call to new segment. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1296#action_1296 ] Buddika Gajapala commented on LUCENE-2454: -- I tried this solution and works perfectly for smaller indexes with (either less number of Documents or Document size is small) However for larger indexes that span across multiple segments it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index but actual traversing for matching criteria done by the Scorer is segment-by-segment (i.e. in nextDoc() and advance() methods) . Have you considered this situation? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888908#action_12888908 ] Mark Harwood commented on LUCENE-2454: -- The 2nd comment above talks about this and the need for Lucene to offer more control over flush policy. bq.it only matches the the parent document acurately for the 1st segment. I think this is due to the way the parent docs are marked using a bit array for the ENTIRE index But aren't filters held and evaluated the within the context of each sub reader? Are you sure the issue isn't limited to a parent/child combo that is split across segments? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882899#action_12882899 ] Mark Harwood commented on LUCENE-2454: -- bq. Can this help in searching over multiple child/nested documents? Yes, a typical use case is to use NestedDocumentQuery to fetch the top 10 parents then do a second query to fetch the children using a mandatory clause which lists the primary keys of the selected parents (assuming the children have an indexed field with the parent primary key). The PerParentLimitedQuery can be used to limit the number of child docs returned per parent if there are many e.g. pages in a book. Both these classes are in the zipped attachment to this issue. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878617#action_12878617 ] Mark Harwood commented on LUCENE-2454: -- Yep, I can see an app with a thousand cached filters would have a problem with this impl as it stands. Maintaining parallel indexes always feels a little flaky to me, not least because of the loss of transactional integrity you can get from using a single index. Is another approach to make your cached filters document-type-specific? I.e. they only hold numbers in the range of zero to number-of-docs-of-this-type. To use a cached doc ID in such a filter you would need to make use of mapping arrays to project the type-specific doc id numbers into global doc-id references and back. Lets imagine an index with a mix of A, B and C doc types organised as follows: docIddocType = === 1A 2B 3C 4A 5C 6C The mapping arrays for docType C would look as follows {code:title=Bar.java|borderStyle=solid} int [ ] globalDocIdToTypeCLookUp = {-1,-1,0,-1,1,2}// sparse, sized 0- num docs in overall index int [ ] typeCToGlobalDocIdLookUp = {0,1,2} // dense, sized 0- num type C docs in overall index {code} Your cached filters would be created as follows: {code:title=Bar.java|borderStyle=solid} myTypeCBitset=new OpenBitSet(numberOfTypeCDocs); //this line is hopefully where you save RAM! //for all matching type C docs... myTypeCBitSet.setBit(globalDocIdToTypeCLookUp[realDocId]; {code} Your filters can then be used by dereferencing the child doc IDs as follows: {code:title=Bar.java|borderStyle=solid} int nextRealDocId=typeCToGlobalDocIdLookUp [myTypeCBitSet.getNextSetBit()]; {code} Clearly the mapping arrays come at a cost of 4bytes*num docs which is non trivial. The sparse globalDocIdToTypeCLookUp array shown here could be avoided by reading TermDocs and counting at cached-Filter-create time . Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878741#action_12878741 ] David Smiley commented on LUCENE-2454: -- That's an interesting strategy. The size of these arrays is no big deal to me since there's only a couple of them. My concern with this strategy is that I wonder if potentially many places in Solr would have to be become aware of this scheme which might make this strategy untenable to implement even though its theoretically sound. Another nice thing about the parallel index is that the idf relevancy factor stays clean since it will only consider real documents. I want to investigate these options closer ASAP since this feature you've implemented is something I need. Before I saw this issue, I was going to try something with SpanNearQuery and the masking-field variant. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878434#action_12878434 ] Mark Harwood commented on LUCENE-2454: -- bq. Wow, this is absolutely awesome! Thanks. I've found that this certainly solves problems I previously couldn't address at all in standard Lucene. bq. The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters These filters can obviously be cached but you'll need one filter per level you roll up to. Assuming a 300m doc index and only rolling up matches to the root that should only cost 300m /8 bits per byte = 37.5 meg of RAM. Index reloads should avoid the cost of completely rebuilding this filter nowadays because filters are cached at segment level and unchanged segments will retain their cached filters. Perhaps a bigger concern is any norms arrays which are allocated one BYTE (as opposed to one bit) per document in the index. bq. and they don't share any fields with the parent. For parents with only 1 child document instance of a given type, these could be safely rolled up into the parent and stored in the same document. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878317#action_12878317 ] David Smiley commented on LUCENE-2454: -- Wow, this is absolutely awesome! This is one of the best enhancement requests to Lucene/Solr that I've seen as it brings a real enhancement this is difficult / impossible to do without. The leading concern I have with this implementation is the size of the number of documents in the index as it affects the size of filters and perhaps other areas involving creating BitSet's. I have a scenario in which the sub-documents number on average over 100 to each primary document. These sub-documents are at least very small, and they don't share any fields with the parent. For a large scale search situation, an index containing 3M lucene documents now needs to store over 300M, and thus require 100x the amount of RAM for filter caches as I require now. Thoughts? Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866128#action_12866128 ] Mark Harwood commented on LUCENE-2454: -- Robust use of this feature is dependent on careful management of segments i.e. that all compound documents are held in the same segment. Michael Busch suggested the introduction of a new FlushPolicy on IndexWriter to offer the required control. (see http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3c4be5a14c.6040...@gmail.com%3e ) Sounds sensible to me given that IndexWriter currently manages to muddle 2 alternative policies in the one implementation and it looks like we now need a third. Is this the place to start the debate on FlushPolicy ? My guess is this change would involve : * Deprecating/removing IndexWriter's setMaxBufferedDocs and setRAMBufferSizeMB. * Providing a new FlushPolicy abstract class that is called with a BufferContext class to hold number buffered docs + ram usage. FlushPolicy is asked if flushing of various structures should be triggered given the context * Provide default implementations of FlushPolicy that are number-of-documents-based and RAM-based. * Provide a special NestedDocumentFlushPolicy that can wrap any other policy (ram/num docs) but only triggers flushes when application code has primed it to say a batch of related documents is completed. Let me know where it's best to continue the thinking on these IndexWriter changes. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866134#action_12866134 ] Earwin Burrfoot commented on LUCENE-2454: - Both things can be combined for sure. New stream-like indexing API stuffs docs into IW and controls when flushes /can/ happen, while FlushPolicy decides if they actually /do/ happen, when they /can/. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2454) Nested Document query support
[ https://issues.apache.org/jira/browse/LUCENE-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866148#action_12866148 ] Mark Harwood commented on LUCENE-2454: -- bq. - there was a discussion on narrowing indexing API to something stream-like Any idea where there that discussion was taking place? Happy to move flush-control discussions elsewhere if that is more appropriate. Nested Document query support - Key: LUCENE-2454 URL: https://issues.apache.org/jira/browse/LUCENE-2454 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.2 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: LuceneNestedDocumentSupport-1.zip A facility for querying nested documents in a Lucene index as outlined in http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org