Re: revisit naming for grouping/join?
On Tue, Jul 5, 2011 at 5:44 PM, Mike Sokolov soko...@ifactory.com wrote: : Maybe modules/nested? modules/nesteddocs? modules/subdocs modules/nesteddocs modules/nested None of them scream this is the perfect name to me, but none of them scream dear lord this is a terrible idea either. Instinct says All other factors being equal, pick the shortest name : Hmm... sub feels like it undersells, ie emphasizes under or : inferior to and de-emphasizes the strong cooperation w/ the parent. How about modules/superdoc? It wouldn't undersell, at least :) I agree it's no longer under selling :) But I like this even less than sub! First, I think it has the same problems that sub has since it's just symmetric: it's too un-equal, ie implies one side is superior and above the other side, when in fact joining (XML search, product SKUs, nested docs, etc.) are really symmetric. The nested parts of the doc are just as valid a part of the document as the non-nested part. Second, I don't like the super-ness of super (ie, in the sense of supercalifragilisticexpialidocious or superman or superwoman) -- it's too generic, ie, like best or awesome. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
On 07/06/2011 08:47 AM, Michael McCandless wrote: How about modules/superdoc? It wouldn't undersell, at least :) I agree it's no longer under selling :) But I like this even less than sub! First, I think it has the same problems that sub has since it's just symmetric: it's too un-equal, ie implies one side is superior and above the other side, I basically agree, although I think there is an asymmetry in that this is a many-one relation? The main improvement this name makes is the removal of the plural in the other options (doc vs docs). And it's shorter than huperduperdoc :) But otoh nothing I've seen here really captures all that much about index-time vs query-time join, which seems to be the main distinction (why you can't just call it join)? If you're still in the market for names here are a few: StructureJoin, IntrinsicJoin, TreeJoin; Branch? Just brainstorming loosely. Frankly Nest* seems well enough. -Sokolov - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
: Also... I think we are over-thinking the name ;) We can't convey : *everything* in this name; as long as the name makes it clear that : you'll want to consider this / read its javadocs whenever doing : something with nested docs, I think that's sufficient. I think : NestedQueryWrapper (maybe NestedDocsQuery) and NestedDocsCollector are : good enough, at least better than the functional-driven names they now : have... Yeah, that's fair ... i'm not in love with NestedDocsQuery and NestedDocsCollector but i agree they are better then what we have now. : Honestly at this point I'm tempted to just stick with what we have : (the functionally driven names, instead of the dominant use case : driven name). : : At its heart, this query is performing a join (well, finishing the : join that was done during indexing), and despite our efforts to more : descriptively capture the dominant use case, I don't think we're : succeeding. We are basically struggling to find ways to explain what : a join does, into these class names. I really think it's a bad idea to use Join in the name ... i understand that to you this is a join, but as you say it's really just finishing a join that was already done at index time -- for most users join is going to have the connotation of a SQL join where you don't have to normalize the data in advance (ie: build the index with all the docs you want ot join in a block) and we shouldn't use it unless we are talking about a truely generic query time join -- particularly if we are going to use examples i nthe doc that seem like the kind of think you would do with a query time join in SQL. i know you feel like nested (or subdocs or parent) undersells the *possible* usecases of this feature, but the thing to remember is that even in the use cases where the real life data isn't something you might think of as being organized in a nested or hierarchical model, in order to use this feature the user must map their source data model to a Lucene Document model that *does* capture a hierarchy relationship so they can index their data in in the appropraite way. X and Y may not be in a hierarchy, but if you want to join them like this, then the Document for X and the Document for Y must be thought of as being in a hierarchy and indexed in lock step with eachother. Block just doesn't feel like it really conveys this ... but anything along the Nested, Parent, Subdoc, line of terminology would at least give some point of refrence to the idea that the *Document* model in Lucene needs to be organized in this way -- and i think it's really important that the name make that clear. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: revisit naming for grouping/join?
From my external POV on this debate, it seems as though the main point of contention is naming the nature of the relation between documents. Instead of doing that, a name that says that there is some form of relation, but leaving open its nature, might work: something like docrelation? (Avoiding the related documents IR concept would be important here.) Steve -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Wednesday, July 06, 2011 2:59 PM To: dev@lucene.apache.org Subject: Re: revisit naming for grouping/join? : Also... I think we are over-thinking the name ;) We can't convey : *everything* in this name; as long as the name makes it clear that : you'll want to consider this / read its javadocs whenever doing : something with nested docs, I think that's sufficient. I think : NestedQueryWrapper (maybe NestedDocsQuery) and NestedDocsCollector are : good enough, at least better than the functional-driven names they now : have... Yeah, that's fair ... i'm not in love with NestedDocsQuery and NestedDocsCollector but i agree they are better then what we have now. : Honestly at this point I'm tempted to just stick with what we have : (the functionally driven names, instead of the dominant use case : driven name). : : At its heart, this query is performing a join (well, finishing the : join that was done during indexing), and despite our efforts to more : descriptively capture the dominant use case, I don't think we're : succeeding. We are basically struggling to find ways to explain what : a join does, into these class names. I really think it's a bad idea to use Join in the name ... i understand that to you this is a join, but as you say it's really just finishing a join that was already done at index time -- for most users join is going to have the connotation of a SQL join where you don't have to normalize the data in advance (ie: build the index with all the docs you want ot join in a block) and we shouldn't use it unless we are talking about a truely generic query time join -- particularly if we are going to use examples i nthe doc that seem like the kind of think you would do with a query time join in SQL. i know you feel like nested (or subdocs or parent) undersells the *possible* usecases of this feature, but the thing to remember is that even in the use cases where the real life data isn't something you might think of as being organized in a nested or hierarchical model, in order to use this feature the user must map their source data model to a Lucene Document model that *does* capture a hierarchy relationship so they can index their data in in the appropraite way. X and Y may not be in a hierarchy, but if you want to join them like this, then the Document for X and the Document for Y must be thought of as being in a hierarchy and indexed in lock step with eachother. Block just doesn't feel like it really conveys this ... but anything along the Nested, Parent, Subdoc, line of terminology would at least give some point of refrence to the idea that the *Document* model in Lucene needs to be organized in this way -- and i think it's really important that the name make that clear. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
On Mon, Jul 4, 2011 at 3:38 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Maybe modules/nesteddocuments (I think that's more descriptive than : subdocuments)? either way ... subdocuments has the advantage of being a shorter directory name. Yeah both are rather long... Maybe modules/nested? modules/nesteddocs? i kinda wonder about first impressions and the entomology of nested ... it makes me think of bird nests and russion dolls, neither of which really convey the point: nesting in birds is about protecting/incubating and is only a single layer; while russian nesting dolls are singular wrappers arround wrappers arround wrappers. subdocuments seems like it might better because it conveys more of a hierarchical nature (to me anyway). Hmm... sub feels like it undersells, ie emphasizes under or inferior to and de-emphasizes the strong cooperation w/ the parent. Also, the nesting is not just one level -- it can support an arbitrary star join. So you can join from main to table1 and then from table1 to table2 (parent, child, grandchild). You can also join to multiple child tables from the main table. I think nested/nesting has strong enough meaning among programmers that most will understand what it means in this context. : How about NestedDocumentQuery? And NestedDocumentCollector? : : See, you can use NestedDocumentQuery but collect it with any ordinary : collector if you don't care about the nesting (ie, you are only : interested in matches in the parent document space). The : NestedDocumentCollector also collects all the nested docs matching : each parent hit. Hmmm... My suggestion of ParentDocumentQuery was based on the understanding that the simplest usecase was... Query inner = getSomethingThatMatchesSomeChildDocs(); Filter parents = someFilterThatMatcheAllKnownParentDocs() Query outer = new ParentDocumentQuery(inner, parents) TopDocs results = searcher.search(outer) ...and in this case results will contain the parents of the child documents that match inner. is that correct? Correct. if so, then indepenent of the Collector, ParentDocumentQuery (or ParentDocumentQueryWrapper) still seems like it makes the most sense. Hmm, but that doesn't convey that it handles this nesting, ie, that it's joining child docs with parent docs. Also, these queries can be nested (from 2nd join in the star join), and so it could be ChildAndGrandChildrenQuery. I guess Wrapper would make sense since it wraps a query matching the nested docs. I think Document is redundant/implied? Maybe NestedQueryWrapper? For the Collector, i realize now that i totally missunderstood it's api -- for some reason i thought it would wrap another Collector and proxy to the inner collector only the parents, independently collecting/recording the groups of parent-children info which could be asked for later. ChildDocumentsCollector definitely doesn't make ense -- it's not just collecting children, it's collecting Groups made up of parents and children ... GroupCollector is obviously too general though ... i would toss out ParentChildrenTopGroupCollector to make it clear that: a) what you can get out of it are instances of TopGroups b) the Groups consists of Parents and Children ...but that may be trying to convey too much in a classname. I agree we want Top in the name, since it's collecting Top hits according to provided Sort... I don't think we should put Groups in the name just because this class (TopGroups) is used to represent the returned hits. Really in this context they aren't groups in the grouping module sense; they are the nested docs (parent + children), just using TopGroups to represent that for now. In fact, once we generalize TopDocs so that the type of each hit can be parameterized then this collector would return TopDocsNestedDoc and each NestedDoc would have parent docID, maybe sort field values, and then the TopDocsScoreDoc holding the child hits. (But I'm scared of the generics required here!). So I guess I would keep Top but drop Groups, and replace ParentChildren with NestedDocs and move the Top in front: TopNestedDocsCollector. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
either way ... subdocuments has the advantage of being a shorter directory name. Yeah both are rather long... Maybe modules/nested? modules/nesteddocs? I like modules/nested - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
: Maybe modules/nested? modules/nesteddocs? modules/subdocs modules/nesteddocs modules/nested None of them scream this is the perfect name to me, but none of them scream dear lord this is a terrible idea either. Instinct says All other factors being equal, pick the shortest name : Hmm... sub feels like it undersells, ie emphasizes under or : inferior to and de-emphasizes the strong cooperation w/ the parent. How about modules/superdoc? It wouldn't undersell, at least :) -Sokolov and SuperDocQueryWrapper, etc... - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
: How about modules/superdoc? : : It wouldn't undersell, at least :) you make a frightening compelling argument modules/superdoc - Classes for dealing with Data which can be modeled as a nested hierarchy of documents which can be contained in super documents (which may themselves be contained in larger super documents) SuperDocQueryWraper - given a query which matches documents, wraps that query and returns the super documents TopSuperDocCollector - when used to collect matches of a query, collects an SuperDoc instances for each matching document, containing the matching nested document if that doc has any. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
: In my example the city was parent -- I raised this example to explain : that index-time joining is more general than just nested docs (ie, I : think we should keep the name join for this module... also because : we should factor out more general search-time-only join capabilities : into it). i think that may be the wrong approach to take when discussing examples, while it's great to say there are dozens of usecases that these features can all support in dozens of diff ways we should relaly focus on naming/deming these use cases in the ways where they really make the most sense. In otherwords, i don't think we should say All of these types of problems are different types of nails, and all of these modules are specialty hammers that are slightly distinct from eachother in how they work, but you can use any of these hammers on any of these nails instead we should say here are some specialty hammers, you can use them for lots of types of nails, ut for each hammer here is the type of nail where it really shines block-index-join as i understand it requires all the docs you want to join up to be in one contigious range of docids in the index, so if you want to re-index one doc in a block you have to re-index the entire block -- so the city/doctor example doesn't sound like a good generic example of when/why to use this (because a doctor might change his office hours, or address -- maybe even leavong the city completely, while a city might change it's population w/o the doctor being affected at all. The book and pages example seems much more appropriate, since in the real world these things change in lock step -- pages aren't added/removed to a book; pages don't change w/o the book itself being fundementally changed. the fields of a page document are the text of that page, and that is inheriently data about the book -- the fields of a doctor document are metadata about the doctor, and that is not inheriently data about the city the doctor lives in. as for the name ... i understand why it's called module/join and i understand why the classes are called BlockJoinQuery and BlockJoinCollector but i don't think those names really stand out and convey to end users what they do and how/why they are useful. Personally i think better names would be modules/subdocuments, ParentDocumentQuery and ChildDocumentsCollector I know mcccandless isn't a fan of the name Nested Documents because this functionality *can* be used for use cases where the data being modeled is not strictly organized in a nested relationship, but that doesn't mean it's *optimal* or easy for a user to apply to other usecases, because they have to design their model (and their indexing strategy) in such a way that they think them as nested or hierarchical documents. Naming it module/subdocuments would not only emphasis the usecase where it really shines, it would more importantly draw attention to how users have to model their data in order to take advantage of it -- and using ParentDocument and ChildDocuments in the names of the Query/Collector would make it clear what they match on relative the underlying query that they wrap/collect it would also help distibguish from more general joins like what solr does today -- it seems like that should eventually take the name module/join At a minum we should rename what we have now modules/block-join or modules/index-join (but the later is confusing) and eventually add modules/query-join (yes, yes, block joins provide a query, btu the differnce is when you you have to make a decision about how you want to join your model, at index time or at query time. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
OK I'm sold! I agree: let's rename this new module according to the most likely use case, not according to its logical function, and I agree nested documents is the compelling use case here. Then fully generic joins can go to a new module/join. Maybe modules/nesteddocuments (I think that's more descriptive than subdocuments)? How about NestedDocumentQuery? And NestedDocumentCollector? See, you can use NestedDocumentQuery but collect it with any ordinary collector if you don't care about the nesting (ie, you are only interested in matches in the parent document space). The NestedDocumentCollector also collects all the nested docs matching each parent hit. You can of course still use this Query/Collector for any kind of join, as long as your app is able to do this join at indexing time and index all joined docs to a single row of the primary table as a doc block. But this will presumably be a less common use case so I agree we should just name this feature according to its common use case. Mike McCandless http://blog.mikemccandless.com On Mon, Jul 4, 2011 at 1:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : In my example the city was parent -- I raised this example to explain : that index-time joining is more general than just nested docs (ie, I : think we should keep the name join for this module... also because : we should factor out more general search-time-only join capabilities : into it). i think that may be the wrong approach to take when discussing examples, while it's great to say there are dozens of usecases that these features can all support in dozens of diff ways we should relaly focus on naming/deming these use cases in the ways where they really make the most sense. In otherwords, i don't think we should say All of these types of problems are different types of nails, and all of these modules are specialty hammers that are slightly distinct from eachother in how they work, but you can use any of these hammers on any of these nails instead we should say here are some specialty hammers, you can use them for lots of types of nails, ut for each hammer here is the type of nail where it really shines block-index-join as i understand it requires all the docs you want to join up to be in one contigious range of docids in the index, so if you want to re-index one doc in a block you have to re-index the entire block -- so the city/doctor example doesn't sound like a good generic example of when/why to use this (because a doctor might change his office hours, or address -- maybe even leavong the city completely, while a city might change it's population w/o the doctor being affected at all. The book and pages example seems much more appropriate, since in the real world these things change in lock step -- pages aren't added/removed to a book; pages don't change w/o the book itself being fundementally changed. the fields of a page document are the text of that page, and that is inheriently data about the book -- the fields of a doctor document are metadata about the doctor, and that is not inheriently data about the city the doctor lives in. as for the name ... i understand why it's called module/join and i understand why the classes are called BlockJoinQuery and BlockJoinCollector but i don't think those names really stand out and convey to end users what they do and how/why they are useful. Personally i think better names would be modules/subdocuments, ParentDocumentQuery and ChildDocumentsCollector I know mcccandless isn't a fan of the name Nested Documents because this functionality *can* be used for use cases where the data being modeled is not strictly organized in a nested relationship, but that doesn't mean it's *optimal* or easy for a user to apply to other usecases, because they have to design their model (and their indexing strategy) in such a way that they think them as nested or hierarchical documents. Naming it module/subdocuments would not only emphasis the usecase where it really shines, it would more importantly draw attention to how users have to model their data in order to take advantage of it -- and using ParentDocument and ChildDocuments in the names of the Query/Collector would make it clear what they match on relative the underlying query that they wrap/collect it would also help distibguish from more general joins like what solr does today -- it seems like that should eventually take the name module/join At a minum we should rename what we have now modules/block-join or modules/index-join (but the later is confusing) and eventually add modules/query-join (yes, yes, block joins provide a query, btu the differnce is when you you have to make a decision about how you want to join your model, at index time or at query time. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands,
Re: revisit naming for grouping/join?
: Maybe modules/nesteddocuments (I think that's more descriptive than : subdocuments)? either way ... subdocuments has the advantage of being a shorter directory name. i kinda wonder about first impressions and the entomology of nested ... it makes me think of bird nests and russion dolls, neither of which really convey the point: nesting in birds is about protecting/incubating and is only a single layer; while russian nesting dolls are singular wrappers arround wrappers arround wrappers. subdocuments seems like it might better because it conveys more of a hierarchical nature (to me anyway). : How about NestedDocumentQuery? And NestedDocumentCollector? : : See, you can use NestedDocumentQuery but collect it with any ordinary : collector if you don't care about the nesting (ie, you are only : interested in matches in the parent document space). The : NestedDocumentCollector also collects all the nested docs matching : each parent hit. Hmmm... My suggestion of ParentDocumentQuery was based on the understanding that the simplest usecase was... Query inner = getSomethingThatMatchesSomeChildDocs(); Filter parents = someFilterThatMatcheAllKnownParentDocs() Query outer = new ParentDocumentQuery(inner, parents) TopDocs results = searcher.search(outer) ...and in this case results will contain the parents of the child documents that match inner. is that correct? if so, then indepenent of the Collector, ParentDocumentQuery (or ParentDocumentQueryWrapper) still seems like it makes the most sense. For the Collector, i realize now that i totally missunderstood it's api -- for some reason i thought it would wrap another Collector and proxy to the inner collector only the parents, independently collecting/recording the groups of parent-children info which could be asked for later. ChildDocumentsCollector definitely doesn't make ense -- it's not just collecting children, it's collecting Groups made up of parents and children ... GroupCollector is obviously too general though ... i would toss out ParentChildrenTopGroupCollector to make it clear that: a) what you can get out of it are instances of TopGroups b) the Groups consists of Parents and Children ...but that may be trying to convey too much in a classname. I certianly wouldn't complain about NestedDocumentCollector or SubDocumentCollector if people like those better. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
On Fri, Jul 1, 2011 at 10:02 AM, Robert Muir rcm...@gmail.com wrote: On Fri, Jul 1, 2011 at 8:51 AM, Michael McCandless luc...@mikemccandless.com wrote: The join module does currently depend on the grouping module, but for a silly reason: just for the TopGroups, to represent the returned hits. We could move TopGroups/GroupDocs into common (thus justifying its generic name!)? Then both join and grouping modules depend on common. Just a suggestion: maybe they belong in the lucene core? And maybe the stuff in the common module belongs in lucene core's util package? +1 If we can generalize TopDocs so that we parameterize the type of each hit, ie it could be a leaf (single doc + score (ScoreDoc) + maybe field values (FieldDoc)) or another TopDocs, then we don't need the separate TopGroups anymore. I guess I'm suggesting we try to keep our modules as flat as possible, with as little dependencies as possible. I think we really already have a 'common' module, thats the lucene core. If multiple modules end up relying upon the same functionality, especially if its something simple like an abstract class (Analyzer) or a utility thing (these mutable integers, etc), then thats a good sign it belongs in core apis. I like this approach. I think we really need to try to nuke all these dependencies between modules: its great to add them as a way to get refactoring started, but ultimately we should try to clean up: because we don't want a complex 'graph' of dependencies but instead something dead-simple. I made a total mess with the analyzers module at first, i think everything depended on it! but now we have nuked almost all dependencies on this thing, except for where it makes sense to have that concrete dependency (benchmark, demo, solr). Good! I think what would be best is a smallish but feature complete demo, ie pull together some easy-to-understand sample content and the build a small demo app around it. We could then show how to use grouping for field collapsing (and for other use cases), joining for nested docs (and for other use cases), etc. For the same reason listed above, I think we should take our contrib/demo and consolidate 'examples' across various places into this demo module. The reason is: * examples typically depend upon 'concrete' stuff, but in general core stuff should work around interfaces/abstract classes: e.g. the faceting module has an analyzers dependency only because of its examples. * examples might want to integrate modules, e.g. an example of how to integrate faceting and grouping or something like that. * examples are important: i think if the same question comes up on the user list often, we should consider adding an example. +1 I think especially now that we have very new interesting modules (facet, join, grouping), we really need corresponding examples to showcase all of this. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
On Fri, Jul 1, 2011 at 9:28 AM, mark harwood markharw...@yahoo.co.uk wrote: I think what would be best is a smallish but feature complete demo, For the nested stuff I had a reasonable demo on LUCENE-2454 that was based around resumes - that use case has the one-to-many characteristics that lends itself to nested e.g. a person has many different qualifications and records of employment. This scenario was illustrated here: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene I also had the book search type scenario where a book has many sections and for the purposes of efficient highlighting/summarisation these sections were treated as child docs which could be read quickly (rather than highlighting a whole book) I think both resumes and book search, and also others like the variants of a product SKU, would all make good examples for the nested docs use case. I'm not sure what the parent was in your doctor and cities example, Mike. If a doctor is in only one city then there is no point making city a child doc as the one city info can happily be combined with the doctor info into a single document with no conflict (doctors have different properties to cities). If the city is the parent with many child doctor docs that makes more sense but feels like a less likely use case e.g. find me a city with doctor x and a different doctor y Searching for a person with excellent java and prefrerably good lucene skills feels like a more real-world example. In my example the city was parent -- I raised this example to explain that index-time joining is more general than just nested docs (ie, I think we should keep the name join for this module... also because we should factor out more general search-time-only join capabilities into it). It feels like documenting some of the trade-offs behind index design choices is useful too e.g. nesting is not too great for very volatile content with constantly changing children while search-time join is more costly in RAM and 2-pass processing +1, especially once we've factored out generic joins. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
I think joining and grouping are two different functions, and we should keep different modules for them... On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir rcm...@gmail.com wrote: Hi, when looking at just a very quick glance at some of the newer grouping/join features, I found myself a little confused about what is exactly what, and I think users might too. They are confusing! I discussed some of this with hossman, and it only seemed to make me even more totally confused about: * difference between field collapsing and grouping I like the name grouping better here: I think field collapsing undersells (it's only one specific way to use grouping). EG, grouping w/o collapsing is useful (eg, Best Buy grouping hits by product category and showing the top 5 in each). * difference between nested documents and the index-time join Similarly I think nested docs undersells index-time join: you can join (either during indexing or during searching) in many different ways, and nested docs is just one use case. EG, maybe your docs are doctors but during indexing you join to a city table with facts about that city (each doctor's office is in a specific city) and then you want to run queries like city's avg annual temp 60 and doctor has good bedside manner or something. * difference between index-time-join/nested documents and single-pass index-time grouping. Is the former only a more general case of the latter? Grouping is purely a presentation concern -- you are not altering which docs hit; you are simply changing how you pick which hits to display (top N by group). So we only have collectors here. The generic (requires 2 passes) collectors can group on anything at search time; the doc block collector requires that you indexed all docs in each group as a block. Join is both about restricting matches and also presentation of hits, because your query needs to match fields from different [logical] tables (so, the module has a Query and a Collector). When you get the results back, you may or may not be interested in retaining the table structure in your result set (ie, you may not have selected fields from the child table). Similarly, generic joining (in Solr/ElasticSearch today but I'd like to factor into the join module) can do any join at search time, while the doc block collector requires that you did the necessary join(s) during indexing. * difference between the above joinish capabilities and solr's join impl... other than the single-pass/index-time limitation (which is really an implementation detail), I'm talking about use cases. Solr's/ElasticSearch's join is more general because you can join anything at search time (even, across 2 different indexes), vs doc block join where you must pick which joins you will ever want to use and then build the index accordingly. You can also mix the two. Maybe you do certain joins while indexing, but then at search time you do other joins generically. That's fine. (Same is true for grouping). I think its especially interesting since the join module depends on the grouping module. The join module does currently depend on the grouping module, but for a silly reason: just for the TopGroups, to represent the returned hits. We could move TopGroups/GroupDocs into common (thus justifying its generic name!)? Then both join and grouping modules depend on common. Really TopGroups is just a TopDocs that allows some recursion (ie, each hit may in turn be another TopDocs). But TopGroups is limited now to only depth 2 recursion... we need to fix this for nested grouping. Really we just need a recursive TopDocs here So I am curious if we should: * add docs (maybe with simple examples) in the package.html or otherwise that differentiate what these guys are, or at least agree on some consistent terminology and define it somewhere? I feel like people have explained to me the differences in all these things before, but then its easy to forget. Well, each module's package.html has a start here, but I agree we should do more. I think what would be best is a smallish but feature complete demo, ie pull together some easy-to-understand sample content and the build a small demo app around it. We could then show how to use grouping for field collapsing (and for other use cases), joining for nested docs (and for other use cases), etc. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
I think what would be best is a smallish but feature complete demo, For the nested stuff I had a reasonable demo on LUCENE-2454 that was based around resumes - that use case has the one-to-many characteristics that lends itself to nested e.g. a person has many different qualifications and records of employment. This scenario was illustrated here: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene I also had the book search type scenario where a book has many sections and for the purposes of efficient highlighting/summarisation these sections were treated as child docs which could be read quickly (rather than highlighting a whole book) I'm not sure what the parent was in your doctor and cities example, Mike. If a doctor is in only one city then there is no point making city a child doc as the one city info can happily be combined with the doctor info into a single document with no conflict (doctors have different properties to cities). If the city is the parent with many child doctor docs that makes more sense but feels like a less likely use case e.g. find me a city with doctor x and a different doctor y Searching for a person with excellent java and prefrerably good lucene skills feels like a more real-world example. It feels like documenting some of the trade-offs behind index design choices is useful too e.g. nesting is not too great for very volatile content with constantly changing children while search-time join is more costly in RAM and 2-pass processing Cheers Mark - Original Message From: Michael McCandless luc...@mikemccandless.com To: dev@lucene.apache.org Sent: Fri, 1 July, 2011 13:51:04 Subject: Re: revisit naming for grouping/join? I think joining and grouping are two different functions, and we should keep different modules for them... On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir rcm...@gmail.com wrote: Hi, when looking at just a very quick glance at some of the newer grouping/join features, I found myself a little confused about what is exactly what, and I think users might too. They are confusing! I discussed some of this with hossman, and it only seemed to make me even more totally confused about: * difference between field collapsing and grouping I like the name grouping better here: I think field collapsing undersells (it's only one specific way to use grouping). EG, grouping w/o collapsing is useful (eg, Best Buy grouping hits by product category and showing the top 5 in each). * difference between nested documents and the index-time join Similarly I think nested docs undersells index-time join: you can join (either during indexing or during searching) in many different ways, and nested docs is just one use case. EG, maybe your docs are doctors but during indexing you join to a city table with facts about that city (each doctor's office is in a specific city) and then you want to run queries like city's avg annual temp 60 and doctor has good bedside manner or something. * difference between index-time-join/nested documents and single-pass index-time grouping. Is the former only a more general case of the latter? Grouping is purely a presentation concern -- you are not altering which docs hit; you are simply changing how you pick which hits to display (top N by group). So we only have collectors here. The generic (requires 2 passes) collectors can group on anything at search time; the doc block collector requires that you indexed all docs in each group as a block. Join is both about restricting matches and also presentation of hits, because your query needs to match fields from different [logical] tables (so, the module has a Query and a Collector). When you get the results back, you may or may not be interested in retaining the table structure in your result set (ie, you may not have selected fields from the child table). Similarly, generic joining (in Solr/ElasticSearch today but I'd like to factor into the join module) can do any join at search time, while the doc block collector requires that you did the necessary join(s) during indexing. * difference between the above joinish capabilities and solr's join impl... other than the single-pass/index-time limitation (which is really an implementation detail), I'm talking about use cases. Solr's/ElasticSearch's join is more general because you can join anything at search time (even, across 2 different indexes), vs doc block join where you must pick which joins you will ever want to use and then build the index accordingly. You can also mix the two. Maybe you do certain joins while indexing, but then at search time you do other joins generically. That's fine. (Same is true for grouping). I think its especially interesting since the join module depends on the grouping module. The join module does currently depend on the grouping module, but for a silly reason: just for the TopGroups, to represent the returned hits. We could
Re: revisit naming for grouping/join?
On Fri, Jul 1, 2011 at 8:51 AM, Michael McCandless luc...@mikemccandless.com wrote: The join module does currently depend on the grouping module, but for a silly reason: just for the TopGroups, to represent the returned hits. We could move TopGroups/GroupDocs into common (thus justifying its generic name!)? Then both join and grouping modules depend on common. Just a suggestion: maybe they belong in the lucene core? And maybe the stuff in the common module belongs in lucene core's util package? I guess I'm suggesting we try to keep our modules as flat as possible, with as little dependencies as possible. I think we really already have a 'common' module, thats the lucene core. If multiple modules end up relying upon the same functionality, especially if its something simple like an abstract class (Analyzer) or a utility thing (these mutable integers, etc), then thats a good sign it belongs in core apis. I think we really need to try to nuke all these dependencies between modules: its great to add them as a way to get refactoring started, but ultimately we should try to clean up: because we don't want a complex 'graph' of dependencies but instead something dead-simple. I made a total mess with the analyzers module at first, i think everything depended on it! but now we have nuked almost all dependencies on this thing, except for where it makes sense to have that concrete dependency (benchmark, demo, solr). I think what would be best is a smallish but feature complete demo, ie pull together some easy-to-understand sample content and the build a small demo app around it. We could then show how to use grouping for field collapsing (and for other use cases), joining for nested docs (and for other use cases), etc. For the same reason listed above, I think we should take our contrib/demo and consolidate 'examples' across various places into this demo module. The reason is: * examples typically depend upon 'concrete' stuff, but in general core stuff should work around interfaces/abstract classes: e.g. the faceting module has an analyzers dependency only because of its examples. * examples might want to integrate modules, e.g. an example of how to integrate faceting and grouping or something like that. * examples are important: i think if the same question comes up on the user list often, we should consider adding an example. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: revisit naming for grouping/join?
On 7/1/2011 at 10:02 AM, Robert Muir wrote: [...] I think we should take our contrib/demo and consolidate 'examples' across various places into this demo module. The reason is: * examples typically depend upon 'concrete' stuff, but in general core stuff should work around interfaces/abstract classes: e.g. the faceting module has an analyzers dependency only because of its examples. * examples might want to integrate modules, e.g. an example of how to integrate faceting and grouping or something like that. * examples are important: i think if the same question comes up on the user list often, we should consider adding an example. +1
revisit naming for grouping/join?
Hi, when looking at just a very quick glance at some of the newer grouping/join features, I found myself a little confused about what is exactly what, and I think users might too. I discussed some of this with hossman, and it only seemed to make me even more totally confused about: * difference between field collapsing and grouping * difference between nested documents and the index-time join * difference between index-time-join/nested documents and single-pass index-time grouping. Is the former only a more general case of the latter? * difference between the above joinish capabilities and solr's join impl... other than the single-pass/index-time limitation (which is really an implementation detail), I'm talking about use cases. I think its especially interesting since the join module depends on the grouping module. So I am curious if we should: * add docs (maybe with simple examples) in the package.html or otherwise that differentiate what these guys are, or at least agree on some consistent terminology and define it somewhere? I feel like people have explained to me the differences in all these things before, but then its easy to forget. * should we rename the join module to nested? or combine it with grouping as a subdocument module? or something else? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: revisit naming for grouping/join?
Hi, On Fri, Jul 1, 2011 at 2:30 PM, Robert Muir rcm...@gmail.com wrote: Hi, when looking at just a very quick glance at some of the newer grouping/join features, I found myself a little confused about what is exactly what, and I think users might too. I discussed some of this with hossman, and it only seemed to make me even more totally confused about: * difference between field collapsing and grouping * difference between nested documents and the index-time join * difference between index-time-join/nested documents and single-pass index-time grouping. Is the former only a more general case of the latter? * difference between the above joinish capabilities and solr's join impl... other than the single-pass/index-time limitation (which is really an implementation detail), I'm talking about use cases. I think its especially interesting since the join module depends on the grouping module. So I am curious if we should: * add docs (maybe with simple examples) in the package.html or otherwise that differentiate what these guys are, or at least agree on some consistent terminology and define it somewhere? I feel like people have explained to me the differences in all these things before, but then its easy to forget. * should we rename the join module to nested? or combine it with grouping as a subdocument module? or something else? What about, dear I say, a document-relation module? Joining documents across queries, in docblocks, into a group, or as part of a nest, all seem to be about relationships between documents. With all of these concepts in the single module, we can then definitely work on the package.htmls to make it clear the purpose of each part (and maybe list other known phrases for the same concepts, such as field collapsing and grouping). - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl