Re: revisit naming for grouping/join?

2011-07-06 Thread Michael McCandless
On Tue, Jul 5, 2011 at 5:44 PM, Mike Sokolov soko...@ifactory.com wrote:

 : Maybe modules/nested? modules/nesteddocs?

        modules/subdocs
        modules/nesteddocs
        modules/nested

 None of them scream this is the perfect name to me, but none of them
 scream dear lord this is a terrible idea either.

 Instinct says All other factors being equal, pick the shortest name

 : Hmm... sub feels like it undersells, ie emphasizes under or
 : inferior to and de-emphasizes the strong cooperation w/ the parent.


 How about modules/superdoc?

 It wouldn't undersell, at least :)

I agree it's no longer under selling :)

But I like this even less than sub!  First, I think it has the same
problems that sub has since it's just symmetric: it's too un-equal, ie
implies one side is superior and above the other side, when in
fact joining (XML search, product SKUs, nested docs, etc.) are really
symmetric.  The nested parts of the doc are just as valid a part of
the document as the non-nested part.

Second, I don't like the super-ness of super (ie, in the sense of
supercalifragilisticexpialidocious or superman or superwoman) -- it's
too generic, ie, like best or awesome.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-06 Thread Mike Sokolov



On 07/06/2011 08:47 AM, Michael McCandless wrote:

How about modules/superdoc?

It wouldn't undersell, at least :)
 

I agree it's no longer under selling :)

But I like this even less than sub!  First, I think it has the same
problems that sub has since it's just symmetric: it's too un-equal, ie
implies one side is superior and above the other side,
I basically agree, although I think there is an asymmetry in that this 
is a many-one relation?  The main improvement this name makes is the 
removal of the plural in the other options (doc vs docs).  And it's 
shorter than huperduperdoc :)  But otoh nothing I've seen here really 
captures all that much about index-time vs query-time join, which seems 
to be the main distinction (why you can't just call it join)?  If 
you're still in the market for names here are a few: StructureJoin, 
IntrinsicJoin, TreeJoin; Branch? Just brainstorming loosely.  Frankly 
Nest* seems well enough.


-Sokolov

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-06 Thread Chris Hostetter

: Also... I think we are over-thinking the name ;)  We can't convey
: *everything* in this name; as long as the name makes it clear that
: you'll want to consider this / read its javadocs whenever doing
: something with nested docs, I think that's sufficient.  I think
: NestedQueryWrapper (maybe NestedDocsQuery) and NestedDocsCollector are
: good enough, at least better than the functional-driven names they now
: have...

Yeah, that's fair ... i'm not in love with NestedDocsQuery and 
NestedDocsCollector but i agree they are better then what we have now.

: Honestly at this point I'm tempted to just stick with what we have
: (the functionally driven names, instead of the dominant use case
: driven name).
: 
: At its heart, this query is performing a join (well, finishing the
: join that was done during indexing), and despite our efforts to more
: descriptively capture the dominant use case, I don't think we're
: succeeding.  We are basically struggling to find ways to explain what
: a join does, into these class names.

I really think it's a bad idea to use Join in the name ... i understand 
that to you this is a join, but as you say it's really just finishing a 
join that was already done at index time -- for most users join is 
going to have the connotation of a SQL join where you don't have to 
normalize the data in advance (ie: build the index with all the docs you 
want ot join in a block) and we shouldn't use it unless we are talking 
about a truely generic query time join -- particularly if we are going to 
use examples i nthe doc that seem like the kind of think you would do with 
a query time join in SQL.

i know you feel like nested (or subdocs or parent) undersells the 
*possible* usecases of this feature, but the thing to remember is that 
even in the use cases where the real life data isn't something you might 
think of as being organized in a nested or hierarchical model, in 
order to use this feature the user must map their source data model to a 
Lucene Document model that *does* capture a hierarchy relationship so they 
can index their data in in the appropraite way.  X and Y may not be in a 
hierarchy, but if you want to join them like this, then the Document for X 
and the Document for Y must be thought of as being in a hierarchy and 
indexed in lock step with eachother.

Block just doesn't feel like it really conveys this ... but anything 
along the Nested, Parent, Subdoc, line of terminology would at least 
give some point of refrence to the idea that the *Document* model in 
Lucene needs to be organized in this way -- and i think it's really 
important that the name make that clear. 

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: revisit naming for grouping/join?

2011-07-06 Thread Steven A Rowe
From my external POV on this debate, it seems as though the main point of 
contention is naming the nature of the relation between documents.  

Instead of doing that, a name that says that there is some form of relation, 
but leaving open its nature, might work: something like docrelation?  
(Avoiding the related documents IR concept would be important here.)

Steve

 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Wednesday, July 06, 2011 2:59 PM
 To: dev@lucene.apache.org
 Subject: Re: revisit naming for grouping/join?
 
 
 : Also... I think we are over-thinking the name ;)  We can't convey
 : *everything* in this name; as long as the name makes it clear that
 : you'll want to consider this / read its javadocs whenever doing
 : something with nested docs, I think that's sufficient.  I think
 : NestedQueryWrapper (maybe NestedDocsQuery) and NestedDocsCollector are
 : good enough, at least better than the functional-driven names they now
 : have...
 
 Yeah, that's fair ... i'm not in love with NestedDocsQuery and
 NestedDocsCollector but i agree they are better then what we have now.
 
 : Honestly at this point I'm tempted to just stick with what we have
 : (the functionally driven names, instead of the dominant use case
 : driven name).
 :
 : At its heart, this query is performing a join (well, finishing the
 : join that was done during indexing), and despite our efforts to more
 : descriptively capture the dominant use case, I don't think we're
 : succeeding.  We are basically struggling to find ways to explain what
 : a join does, into these class names.
 
 I really think it's a bad idea to use Join in the name ... i understand
 that to you this is a join, but as you say it's really just finishing a
 join that was already done at index time -- for most users join is
 going to have the connotation of a SQL join where you don't have to
 normalize the data in advance (ie: build the index with all the docs you
 want ot join in a block) and we shouldn't use it unless we are talking
 about a truely generic query time join -- particularly if we are going to
 use examples i nthe doc that seem like the kind of think you would do
 with
 a query time join in SQL.
 
 i know you feel like nested (or subdocs or parent) undersells the
 *possible* usecases of this feature, but the thing to remember is that
 even in the use cases where the real life data isn't something you might
 think of as being organized in a nested or hierarchical model, in
 order to use this feature the user must map their source data model to a
 Lucene Document model that *does* capture a hierarchy relationship so
 they
 can index their data in in the appropraite way.  X and Y may not be in a
 hierarchy, but if you want to join them like this, then the Document for
 X
 and the Document for Y must be thought of as being in a hierarchy and
 indexed in lock step with eachother.
 
 Block just doesn't feel like it really conveys this ... but anything
 along the Nested, Parent, Subdoc, line of terminology would at
 least
 give some point of refrence to the idea that the *Document* model in
 Lucene needs to be organized in this way -- and i think it's really
 important that the name make that clear.
 
 -Hoss
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-05 Thread Michael McCandless
On Mon, Jul 4, 2011 at 3:38 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Maybe modules/nesteddocuments (I think that's more descriptive than
 : subdocuments)?

 either way ... subdocuments has the advantage of being a shorter directory
 name.

Yeah both are rather long...

Maybe modules/nested? modules/nesteddocs?

 i kinda wonder about first impressions and the entomology of nested ...
 it makes me think of bird nests and russion dolls, neither of which
 really convey the point: nesting in birds is about protecting/incubating
 and is only a single layer; while russian nesting dolls are singular
 wrappers arround wrappers arround wrappers.

 subdocuments seems like it might better because it conveys more of a
 hierarchical nature (to me anyway).

Hmm... sub feels like it undersells, ie emphasizes under or
inferior to and de-emphasizes the strong cooperation w/ the parent.

Also, the nesting is not just one level -- it can support an arbitrary
star join.  So you can join from main to table1 and then from table1
to table2 (parent, child, grandchild).  You can also join to multiple
child tables from the main table.

I think nested/nesting has strong enough meaning among programmers
that most will understand what it means in this context.

 : How about NestedDocumentQuery?  And NestedDocumentCollector?
 :
 : See, you can use NestedDocumentQuery but collect it with any ordinary
 : collector if you don't care about the nesting (ie, you are only
 : interested in matches in the parent document space).  The
 : NestedDocumentCollector also collects all the nested docs matching
 : each parent hit.

 Hmmm...

 My suggestion of ParentDocumentQuery was based on the understanding that
 the simplest usecase was...

  Query inner = getSomethingThatMatchesSomeChildDocs();
  Filter parents = someFilterThatMatcheAllKnownParentDocs()
  Query outer = new ParentDocumentQuery(inner, parents)
  TopDocs results = searcher.search(outer)

 ...and in this case results will contain the parents of the child
 documents that match inner.  is that correct?

Correct.

 if so, then indepenent of the Collector, ParentDocumentQuery (or
 ParentDocumentQueryWrapper) still seems like it makes the most sense.

Hmm, but that doesn't convey that it handles this nesting, ie, that
it's joining child docs with parent docs.

Also, these queries can be nested (from 2nd join in the star join),
and so it could be ChildAndGrandChildrenQuery.

I guess Wrapper would make sense since it wraps a query matching the
nested docs.  I think Document is redundant/implied?

Maybe NestedQueryWrapper?

 For the Collector, i realize now that i totally missunderstood it's api --
 for some reason i thought it would wrap another Collector and proxy to the
 inner collector only the parents, independently collecting/recording the
 groups of parent-children info which could be asked for later.

 ChildDocumentsCollector definitely doesn't make ense -- it's not
 just collecting children, it's collecting Groups made up of parents
 and children ... GroupCollector is obviously too general though ... i
 would toss out ParentChildrenTopGroupCollector to make it clear that:
  a) what you can get out of it are instances of TopGroups
  b) the Groups consists of Parents and Children

 ...but that may be trying to convey too much in a classname.

I agree we want Top in the name, since it's collecting Top hits
according to provided Sort... I don't think we should put Groups in
the name just because this class (TopGroups) is used to represent the
returned hits.  Really in this context they aren't groups in the
grouping module sense; they are the nested docs (parent + children),
just using TopGroups to represent that for now.

In fact, once we generalize TopDocs so that the type of each hit can
be parameterized then this collector would return TopDocsNestedDoc
and each NestedDoc would have parent docID, maybe sort field values,
and then the TopDocsScoreDoc holding the child hits.  (But I'm
scared of the generics required here!).

So I guess I would keep Top but drop Groups, and replace
ParentChildren with NestedDocs and move the Top in front:
TopNestedDocsCollector.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-05 Thread Ryan McKinley

 either way ... subdocuments has the advantage of being a shorter directory
 name.

 Yeah both are rather long...

 Maybe modules/nested? modules/nesteddocs?


I like modules/nested

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-05 Thread Mike Sokolov



: Maybe modules/nested? modules/nesteddocs?

modules/subdocs
modules/nesteddocs
modules/nested

None of them scream this is the perfect name to me, but none of them
scream dear lord this is a terrible idea either.

Instinct says All other factors being equal, pick the shortest name

: Hmm... sub feels like it undersells, ie emphasizes under or
: inferior to and de-emphasizes the strong cooperation w/ the parent.
   

How about modules/superdoc?

It wouldn't undersell, at least :)

-Sokolov

and SuperDocQueryWrapper, etc...

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-05 Thread Chris Hostetter

: How about modules/superdoc?
: 
: It wouldn't undersell, at least :)

you make a frightening compelling argument

modules/superdoc - Classes for dealing with Data which can be 
modeled as a nested hierarchy of documents which can be contained in 
super documents (which may themselves be contained in larger super 
documents)

SuperDocQueryWraper - given a query which matches documents, wraps that 
query and returns the super documents

TopSuperDocCollector - when used to collect matches of a query, 
collects an SuperDoc instances for each matching document, containing the 
matching nested document if that doc has any.



-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-04 Thread Chris Hostetter

: In my example the city was parent -- I raised this example to explain
: that index-time joining is more general than just nested docs (ie, I
: think we should keep the name join for this module... also because
: we should factor out more general search-time-only join capabilities
: into it).

i think that may be the wrong approach to take when discussing examples, 
while it's great to say there are dozens of usecases that these features 
can all support in dozens of diff ways we should relaly focus on 
naming/deming these use cases in the ways where they really make the most 
sense.

In otherwords, i don't think we should say All of these types of problems 
are different types of nails, and all of these modules are specialty 
hammers that are slightly distinct from eachother in how they work, but 
you can use any of these hammers on any of these nails  instead we should 
say here are some specialty hammers, you can use them for lots of 
types of nails, ut for each hammer here is the type of nail where it 
really shines


block-index-join as i understand it requires all the docs you want to 
join up to be in one contigious range of docids in the index, so if you want to 
re-index one doc in a block you have to re-index the entire block -- so 
the city/doctor example doesn't sound like a good generic example of 
when/why to use this (because a doctor might change his office 
hours, or address -- maybe even leavong the city completely, while a 
city might change it's population w/o the doctor being affected at all.

The book and pages example seems much more appropriate, since in the 
real world these things change in lock step -- pages aren't added/removed to 
a book; pages don't change w/o the book itself being fundementally 
changed.  the fields of a page document are the text of that page, and 
that is inheriently data about the book -- the fields of a doctor 
document are metadata about the doctor, and that is not inheriently data 
about the city the doctor lives in.

as for the name ... i understand why it's called module/join and i 
understand why the classes are called BlockJoinQuery and 
BlockJoinCollector but i don't think those names really stand out and 
convey to end users what they do and how/why they are useful.

Personally i think better names would be modules/subdocuments, 
ParentDocumentQuery and ChildDocumentsCollector

I know mcccandless isn't a fan of the name Nested Documents because this 
functionality *can* be used for use cases where the data being modeled is 
not strictly organized in a nested relationship, but that doesn't mean 
it's *optimal* or easy for a user to apply to other usecases, because they 
have to design their model (and their indexing strategy) in such a way 
that they think them as nested or hierarchical documents.  

Naming it module/subdocuments would not only emphasis the usecase where 
it really shines, it would more importantly draw attention to how users 
have to model their data in order to take advantage of it -- and using 
ParentDocument and ChildDocuments in the names of the Query/Collector 
would make it clear what they match on relative the underlying query 
that they wrap/collect

it would also help distibguish from more general joins like what solr 
does today -- it seems like that should eventually take the name 
module/join

At a minum we should rename what we have now modules/block-join or 
modules/index-join (but the later is confusing) and eventually add 
modules/query-join  (yes, yes, block joins provide a query, btu the 
differnce is when you you have to make a decision about how you want to 
join your model, at index time or at query time.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-04 Thread Michael McCandless
OK I'm sold!

I agree: let's rename this new module according to the most likely use
case, not according to its logical function, and I agree nested
documents is the compelling use case here.  Then fully generic joins
can go to a new module/join.

Maybe modules/nesteddocuments (I think that's more descriptive than
subdocuments)?

How about NestedDocumentQuery?  And NestedDocumentCollector?

See, you can use NestedDocumentQuery but collect it with any ordinary
collector if you don't care about the nesting (ie, you are only
interested in matches in the parent document space).  The
NestedDocumentCollector also collects all the nested docs matching
each parent hit.

You can of course still use this Query/Collector for any kind of
join, as long as your app is able to do this join at indexing time
and index all joined docs to a single row of the primary table as a
doc block.  But this will presumably be a less common use case so
I agree we should just name this feature according to its common use
case.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jul 4, 2011 at 1:34 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : In my example the city was parent -- I raised this example to explain
 : that index-time joining is more general than just nested docs (ie, I
 : think we should keep the name join for this module... also because
 : we should factor out more general search-time-only join capabilities
 : into it).

 i think that may be the wrong approach to take when discussing examples,
 while it's great to say there are dozens of usecases that these features
 can all support in dozens of diff ways we should relaly focus on
 naming/deming these use cases in the ways where they really make the most
 sense.

 In otherwords, i don't think we should say All of these types of problems
 are different types of nails, and all of these modules are specialty
 hammers that are slightly distinct from eachother in how they work, but
 you can use any of these hammers on any of these nails  instead we should
 say here are some specialty hammers, you can use them for lots of
 types of nails, ut for each hammer here is the type of nail where it
 really shines


 block-index-join as i understand it requires all the docs you want to
 join up to be in one contigious range of docids in the index, so if you want 
 to
 re-index one doc in a block you have to re-index the entire block -- so
 the city/doctor example doesn't sound like a good generic example of
 when/why to use this (because a doctor might change his office
 hours, or address -- maybe even leavong the city completely, while a
 city might change it's population w/o the doctor being affected at all.

 The book and pages example seems much more appropriate, since in the
 real world these things change in lock step -- pages aren't added/removed to
 a book; pages don't change w/o the book itself being fundementally
 changed.  the fields of a page document are the text of that page, and
 that is inheriently data about the book -- the fields of a doctor
 document are metadata about the doctor, and that is not inheriently data
 about the city the doctor lives in.

 as for the name ... i understand why it's called module/join and i
 understand why the classes are called BlockJoinQuery and
 BlockJoinCollector but i don't think those names really stand out and
 convey to end users what they do and how/why they are useful.

 Personally i think better names would be modules/subdocuments,
 ParentDocumentQuery and ChildDocumentsCollector

 I know mcccandless isn't a fan of the name Nested Documents because this
 functionality *can* be used for use cases where the data being modeled is
 not strictly organized in a nested relationship, but that doesn't mean
 it's *optimal* or easy for a user to apply to other usecases, because they
 have to design their model (and their indexing strategy) in such a way
 that they think them as nested or hierarchical documents.

 Naming it module/subdocuments would not only emphasis the usecase where
 it really shines, it would more importantly draw attention to how users
 have to model their data in order to take advantage of it -- and using
 ParentDocument and ChildDocuments in the names of the Query/Collector
 would make it clear what they match on relative the underlying query
 that they wrap/collect

 it would also help distibguish from more general joins like what solr
 does today -- it seems like that should eventually take the name
 module/join

 At a minum we should rename what we have now modules/block-join or
 modules/index-join (but the later is confusing) and eventually add
 modules/query-join  (yes, yes, block joins provide a query, btu the
 differnce is when you you have to make a decision about how you want to
 join your model, at index time or at query time.


 -Hoss

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, 

Re: revisit naming for grouping/join?

2011-07-04 Thread Chris Hostetter

: Maybe modules/nesteddocuments (I think that's more descriptive than
: subdocuments)?

either way ... subdocuments has the advantage of being a shorter directory 
name.  

i kinda wonder about first impressions and the entomology of nested ... 
it makes me think of bird nests and russion dolls, neither of which 
really convey the point: nesting in birds is about protecting/incubating 
and is only a single layer; while russian nesting dolls are singular 
wrappers arround wrappers arround wrappers.

subdocuments seems like it might better because it conveys more of a 
hierarchical nature (to me anyway).

: How about NestedDocumentQuery?  And NestedDocumentCollector?
: 
: See, you can use NestedDocumentQuery but collect it with any ordinary
: collector if you don't care about the nesting (ie, you are only
: interested in matches in the parent document space).  The
: NestedDocumentCollector also collects all the nested docs matching
: each parent hit.

Hmmm... 

My suggestion of ParentDocumentQuery was based on the understanding that 
the simplest usecase was...

  Query inner = getSomethingThatMatchesSomeChildDocs();
  Filter parents = someFilterThatMatcheAllKnownParentDocs()
  Query outer = new ParentDocumentQuery(inner, parents)
  TopDocs results = searcher.search(outer)

...and in this case results will contain the parents of the child 
documents that match inner.  is that correct?

if so, then indepenent of the Collector, ParentDocumentQuery (or 
ParentDocumentQueryWrapper) still seems like it makes the most sense.

For the Collector, i realize now that i totally missunderstood it's api -- 
for some reason i thought it would wrap another Collector and proxy to the 
inner collector only the parents, independently collecting/recording the 
groups of parent-children info which could be asked for later.  

ChildDocumentsCollector definitely doesn't make ense -- it's not 
just collecting children, it's collecting Groups made up of parents 
and children ... GroupCollector is obviously too general though ... i 
would toss out ParentChildrenTopGroupCollector to make it clear that:
  a) what you can get out of it are instances of TopGroups
  b) the Groups consists of Parents and Children

...but that may be trying to convey too much in a classname.  

I certianly wouldn't complain about NestedDocumentCollector or 
SubDocumentCollector if people like those better.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-03 Thread Michael McCandless
On Fri, Jul 1, 2011 at 10:02 AM, Robert Muir rcm...@gmail.com wrote:
 On Fri, Jul 1, 2011 at 8:51 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 The join module does currently depend on the grouping module, but for
 a silly reason: just for the TopGroups, to represent the returned
 hits.  We could move TopGroups/GroupDocs into common (thus justifying
 its generic name!)?  Then both join and grouping modules depend on
 common.

 Just a suggestion: maybe they belong in the lucene core? And maybe the
 stuff in the common module belongs in lucene core's util package?

+1

If we can generalize TopDocs so that we parameterize the type of each
hit, ie it could be a leaf (single doc + score (ScoreDoc) + maybe
field values (FieldDoc)) or another TopDocs, then we don't need the
separate TopGroups anymore.

 I guess I'm suggesting we try to keep our modules as flat as possible,
 with as little dependencies as possible. I think we really already
 have a 'common' module, thats the lucene core. If multiple modules end
 up relying upon the same functionality, especially if its something
 simple like an abstract class (Analyzer) or a utility thing (these
 mutable integers, etc), then thats a good sign it belongs in core
 apis.

I like this approach.

 I think we really need to try to nuke all these dependencies between
 modules: its great to add them as a way to get refactoring started,
 but ultimately we should try to clean up: because we don't want a
 complex 'graph' of dependencies but instead something dead-simple. I
 made a total mess with the analyzers module at first, i think
 everything depended on it! but now we have nuked almost all
 dependencies on this thing, except for where it makes sense to have
 that concrete dependency (benchmark, demo, solr).

Good!

 I think what would be best is a smallish but feature complete demo, ie
 pull together some easy-to-understand sample content and the build a
 small demo app around it.  We could then show how to use grouping for
 field collapsing (and for other use cases), joining for nested docs
 (and for other use cases), etc.


 For the same reason listed above, I think we should take our
 contrib/demo and consolidate 'examples' across various places into
 this demo module. The reason is:
 * examples typically depend upon 'concrete' stuff, but in general core
 stuff should work around interfaces/abstract classes: e.g. the
 faceting module has an analyzers dependency only because of its
 examples.
 * examples might want to integrate modules, e.g. an example of how to
 integrate faceting and grouping or something like that.
 * examples are important: i think if the same question comes up on the
 user list often, we should consider adding an example.

+1

I think especially now that we have very new interesting modules
(facet, join, grouping), we really need corresponding examples to
showcase all of this.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-03 Thread Michael McCandless
On Fri, Jul 1, 2011 at 9:28 AM, mark harwood markharw...@yahoo.co.uk wrote:
 I think what would be best is a smallish but feature complete demo,

 For the nested stuff I had a reasonable demo on LUCENE-2454 that was based
 around resumes - that use case has the one-to-many characteristics that lends
 itself to nested e.g. a person has many different qualifications and records 
 of
 employment.
 This scenario was illustrated
 here: 
 http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

 I also had the book search type scenario where a book has many sections and
 for the purposes of efficient highlighting/summarisation  these sections were
 treated as child docs which could be read quickly (rather than highlighting a
 whole book)

I think both resumes and book search, and also others like the
variants of a product SKU, would all make good examples for the nested
docs use case.

 I'm not sure what the parent was in your doctor and cities example, Mike. 
 If a
 doctor is in only one city then there is no point making city a child doc as 
 the
 one city info can happily be combined with the doctor info into a single
 document with no conflict (doctors have different properties to cities).
 If the city is the parent with many child doctor docs that makes more sense 
 but
 feels like a less likely use case e.g. find me a city with doctor x and a
 different doctor y
 Searching for a person with excellent java and prefrerably good lucene skills
 feels like a more real-world example.

In my example the city was parent -- I raised this example to explain
that index-time joining is more general than just nested docs (ie, I
think we should keep the name join for this module... also because
we should factor out more general search-time-only join capabilities
into it).

 It feels like documenting some of the trade-offs behind index design choices 
 is
 useful too e.g. nesting is not too great for very volatile content with
 constantly changing children while search-time join is more costly in RAM and
 2-pass processing

+1, especially once we've factored out generic joins.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-01 Thread Michael McCandless
I think joining and grouping are two different functions, and we
should keep different modules for them...

On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir rcm...@gmail.com wrote:
 Hi,

 when looking at just a very quick glance at some of the newer
 grouping/join features, I found myself a little confused about what is
 exactly what, and I think users might too.

They are confusing!

 I discussed some of this with hossman, and it only seemed to make me
 even more totally confused about:
 * difference between field collapsing and grouping

I like the name grouping better here: I think field collapsing
undersells (it's only one specific way to use grouping).  EG, grouping
w/o collapsing is useful (eg, Best Buy grouping hits by product
category and showing the top 5 in each).

 * difference between nested documents and the index-time join

Similarly I think nested docs undersells index-time join: you can
join (either during indexing or during searching) in many different
ways, and nested docs is just one use case.

EG, maybe your docs are doctors but during indexing you join to a city
table with facts about that city (each doctor's office is in a
specific city) and then you want to run queries like city's avg
annual temp  60 and doctor has good bedside manner or something.

 * difference between index-time-join/nested documents and single-pass
 index-time grouping. Is the former only a more general case of the
 latter?

Grouping is purely a presentation concern -- you are not altering
which docs hit; you are simply changing how you pick which hits to
display (top N by group).  So we only have collectors here.

The generic (requires 2 passes) collectors can group on anything at
search time; the doc block collector requires that you indexed all
docs in each group as a block.

Join is both about restricting matches and also presentation of hits,
because your query needs to match fields from different [logical]
tables (so, the module has a Query and a Collector).  When you get the
results back, you may or may not be interested in retaining the table
structure in your result set (ie, you may not have selected fields
from the child table).

Similarly, generic joining (in Solr/ElasticSearch today but I'd like
to factor into the join module) can do any join at search time, while
the doc block collector requires that you did the necessary join(s)
during indexing.

 * difference between the above joinish capabilities and solr's join
 impl... other than the single-pass/index-time limitation (which is
 really an implementation detail), I'm talking about use cases.

Solr's/ElasticSearch's join is more general because you can join
anything at search time (even, across 2 different indexes), vs doc
block join where you must pick which joins you will ever want to use
and then build the index accordingly.

You can also mix the two.  Maybe you do certain joins while indexing,
but then at search time you do other joins generically.  That's
fine.  (Same is true for grouping).

 I think its especially interesting since the join module depends on
 the grouping module.

The join module does currently depend on the grouping module, but for
a silly reason: just for the TopGroups, to represent the returned
hits.  We could move TopGroups/GroupDocs into common (thus justifying
its generic name!)?  Then both join and grouping modules depend on
common.

Really TopGroups is just a TopDocs that allows some recursion (ie,
each hit may in turn be another TopDocs).  But TopGroups is limited
now to only depth 2 recursion... we need to fix this for nested
grouping.  Really we just need a recursive TopDocs here

 So I am curious if we should:
 * add docs (maybe with simple examples) in the package.html or
 otherwise that differentiate what these guys are, or at least agree on
 some consistent terminology and define it somewhere? I feel like
 people have explained to me the differences in all these things
 before, but then its easy to forget.

Well, each module's package.html has a start here, but I agree we
should do more.

I think what would be best is a smallish but feature complete demo, ie
pull together some easy-to-understand sample content and the build a
small demo app around it.  We could then show how to use grouping for
field collapsing (and for other use cases), joining for nested docs
(and for other use cases), etc.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-07-01 Thread mark harwood
 I think what would be best is a smallish but feature complete demo,

For the nested stuff I had a reasonable demo on LUCENE-2454 that was based 
around resumes - that use case has the one-to-many characteristics that lends 
itself to nested e.g. a person has many different qualifications and records of 
employment.
This scenario was illustrated 
here: 
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

I also had the book search type scenario where a book has many sections and 
for the purposes of efficient highlighting/summarisation  these sections were 
treated as child docs which could be read quickly (rather than highlighting a 
whole book)

I'm not sure what the parent was in your doctor and cities example, Mike. If 
a 
doctor is in only one city then there is no point making city a child doc as 
the 
one city info can happily be combined with the doctor info into a single 
document with no conflict (doctors have different properties to cities).
If the city is the parent with many child doctor docs that makes more sense but 
feels like a less likely use case e.g. find me a city with doctor x and a 
different doctor y
Searching for a person with excellent java and prefrerably good lucene skills 
feels like a more real-world example.

It feels like documenting some of the trade-offs behind index design choices is 
useful too e.g. nesting is not too great for very volatile content with 
constantly changing children while search-time join is more costly in RAM and 
2-pass processing

Cheers
Mark



- Original Message 
From: Michael McCandless luc...@mikemccandless.com
To: dev@lucene.apache.org
Sent: Fri, 1 July, 2011 13:51:04
Subject: Re: revisit naming for grouping/join?

I think joining and grouping are two different functions, and we
should keep different modules for them...

On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir rcm...@gmail.com wrote:
 Hi,

 when looking at just a very quick glance at some of the newer
 grouping/join features, I found myself a little confused about what is
 exactly what, and I think users might too.

They are confusing!

 I discussed some of this with hossman, and it only seemed to make me
 even more totally confused about:
 * difference between field collapsing and grouping

I like the name grouping better here: I think field collapsing
undersells (it's only one specific way to use grouping).  EG, grouping
w/o collapsing is useful (eg, Best Buy grouping hits by product
category and showing the top 5 in each).

 * difference between nested documents and the index-time join

Similarly I think nested docs undersells index-time join: you can
join (either during indexing or during searching) in many different
ways, and nested docs is just one use case.

EG, maybe your docs are doctors but during indexing you join to a city
table with facts about that city (each doctor's office is in a
specific city) and then you want to run queries like city's avg
annual temp  60 and doctor has good bedside manner or something.

 * difference between index-time-join/nested documents and single-pass
 index-time grouping. Is the former only a more general case of the
 latter?

Grouping is purely a presentation concern -- you are not altering
which docs hit; you are simply changing how you pick which hits to
display (top N by group).  So we only have collectors here.

The generic (requires 2 passes) collectors can group on anything at
search time; the doc block collector requires that you indexed all
docs in each group as a block.

Join is both about restricting matches and also presentation of hits,
because your query needs to match fields from different [logical]
tables (so, the module has a Query and a Collector).  When you get the
results back, you may or may not be interested in retaining the table
structure in your result set (ie, you may not have selected fields
from the child table).

Similarly, generic joining (in Solr/ElasticSearch today but I'd like
to factor into the join module) can do any join at search time, while
the doc block collector requires that you did the necessary join(s)
during indexing.

 * difference between the above joinish capabilities and solr's join
 impl... other than the single-pass/index-time limitation (which is
 really an implementation detail), I'm talking about use cases.

Solr's/ElasticSearch's join is more general because you can join
anything at search time (even, across 2 different indexes), vs doc
block join where you must pick which joins you will ever want to use
and then build the index accordingly.

You can also mix the two.  Maybe you do certain joins while indexing,
but then at search time you do other joins generically.  That's
fine.  (Same is true for grouping).

 I think its especially interesting since the join module depends on
 the grouping module.

The join module does currently depend on the grouping module, but for
a silly reason: just for the TopGroups, to represent the returned
hits.  We could

Re: revisit naming for grouping/join?

2011-07-01 Thread Robert Muir
On Fri, Jul 1, 2011 at 8:51 AM, Michael McCandless
luc...@mikemccandless.com wrote:

 The join module does currently depend on the grouping module, but for
 a silly reason: just for the TopGroups, to represent the returned
 hits.  We could move TopGroups/GroupDocs into common (thus justifying
 its generic name!)?  Then both join and grouping modules depend on
 common.

Just a suggestion: maybe they belong in the lucene core? And maybe the
stuff in the common module belongs in lucene core's util package?

I guess I'm suggesting we try to keep our modules as flat as possible,
with as little dependencies as possible. I think we really already
have a 'common' module, thats the lucene core. If multiple modules end
up relying upon the same functionality, especially if its something
simple like an abstract class (Analyzer) or a utility thing (these
mutable integers, etc), then thats a good sign it belongs in core
apis.

I think we really need to try to nuke all these dependencies between
modules: its great to add them as a way to get refactoring started,
but ultimately we should try to clean up: because we don't want a
complex 'graph' of dependencies but instead something dead-simple. I
made a total mess with the analyzers module at first, i think
everything depended on it! but now we have nuked almost all
dependencies on this thing, except for where it makes sense to have
that concrete dependency (benchmark, demo, solr).


 I think what would be best is a smallish but feature complete demo, ie
 pull together some easy-to-understand sample content and the build a
 small demo app around it.  We could then show how to use grouping for
 field collapsing (and for other use cases), joining for nested docs
 (and for other use cases), etc.


For the same reason listed above, I think we should take our
contrib/demo and consolidate 'examples' across various places into
this demo module. The reason is:
* examples typically depend upon 'concrete' stuff, but in general core
stuff should work around interfaces/abstract classes: e.g. the
faceting module has an analyzers dependency only because of its
examples.
* examples might want to integrate modules, e.g. an example of how to
integrate faceting and grouping or something like that.
* examples are important: i think if the same question comes up on the
user list often, we should consider adding an example.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: revisit naming for grouping/join?

2011-07-01 Thread Steven A Rowe
On 7/1/2011 at 10:02 AM, Robert Muir wrote:
 [...] I think we should take our contrib/demo and consolidate 'examples'
 across various places into this demo module. The reason is:

 * examples typically depend upon 'concrete' stuff, but in general core
   stuff should work around interfaces/abstract classes: e.g. the faceting
   module has an analyzers dependency only because of its examples.

 * examples might want to integrate modules, e.g. an example of how to
   integrate faceting and grouping or something like that.

 * examples are important: i think if the same question comes up on the
   user list often, we should consider adding an example.

+1


revisit naming for grouping/join?

2011-06-30 Thread Robert Muir
Hi,

when looking at just a very quick glance at some of the newer
grouping/join features, I found myself a little confused about what is
exactly what, and I think users might too.

I discussed some of this with hossman, and it only seemed to make me
even more totally confused about:
* difference between field collapsing and grouping
* difference between nested documents and the index-time join
* difference between index-time-join/nested documents and single-pass
index-time grouping. Is the former only a more general case of the
latter?
* difference between the above joinish capabilities and solr's join
impl... other than the single-pass/index-time limitation (which is
really an implementation detail), I'm talking about use cases.

I think its especially interesting since the join module depends on
the grouping module.

So I am curious if we should:
* add docs (maybe with simple examples) in the package.html or
otherwise that differentiate what these guys are, or at least agree on
some consistent terminology and define it somewhere? I feel like
people have explained to me the differences in all these things
before, but then its easy to forget.
* should we rename the join module to nested? or combine it with
grouping as a subdocument module? or something else?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: revisit naming for grouping/join?

2011-06-30 Thread Chris Male
Hi,

On Fri, Jul 1, 2011 at 2:30 PM, Robert Muir rcm...@gmail.com wrote:

 Hi,

 when looking at just a very quick glance at some of the newer
 grouping/join features, I found myself a little confused about what is
 exactly what, and I think users might too.

 I discussed some of this with hossman, and it only seemed to make me
 even more totally confused about:
 * difference between field collapsing and grouping
 * difference between nested documents and the index-time join
 * difference between index-time-join/nested documents and single-pass
 index-time grouping. Is the former only a more general case of the
 latter?
 * difference between the above joinish capabilities and solr's join
 impl... other than the single-pass/index-time limitation (which is
 really an implementation detail), I'm talking about use cases.

 I think its especially interesting since the join module depends on
 the grouping module.

 So I am curious if we should:
 * add docs (maybe with simple examples) in the package.html or
 otherwise that differentiate what these guys are, or at least agree on
 some consistent terminology and define it somewhere? I feel like
 people have explained to me the differences in all these things
 before, but then its easy to forget.
 * should we rename the join module to nested? or combine it with
 grouping as a subdocument module? or something else?


What about, dear I say, a document-relation module? Joining documents across
queries, in docblocks, into a group, or as part of a nest, all seem to be
about relationships between documents.

With all of these concepts in the single module, we can then definitely work
on the package.htmls to make it clear the purpose of each part (and maybe
list other known phrases for the same concepts, such as field collapsing and
grouping).



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl