[ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057148#comment-13057148
 ] 

Toke Eskildsen edited comment on LUCENE-3079 at 6/29/11 11:02 AM:
------------------------------------------------------------------

I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique 
paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the path 
'root/a/b' as starting point for the facet request was added. I also checked 
that indexing does indeed run with 48MB for LUCENE-2369, but again this is just 
plain Lucene indexing.

||                             || LUCENE-3079 (d6) || LUCENE-3079 (d1) || 
LUCENE-3079 (d1+'deep/a/b') || LUCENE-2369 (d6) ||
| Index build time             |  752 s            |   771 s           |    -   
        |     347 s         |
| Memory required for indexing | 2500 MB           | 2500 MB           |   2500 
MB           |      48 MB        |
| First facet request          | 3840 ms           | 1929 ms           |   1963 
ms           | 147,000 ms        |
| Best of 5 requests           | 2688 ms           | 1172 ms           |   1246 
ms           |    2673 ms        |
| Memory usage after faceting  |  435 MB           |  435 MB           |   435 
MB           |     294 MB        |

Going to depth 1 helped a lot. I would have expected that requesting from 
'deep/a/b' was even faster, but I guess I'll have to dig into the code to 
understand why it was not.

bq. It actually speeds up counting overall. If you think about it, when we 
encounter category ordinals, we just increment the count by 1 in the respective 
location in the count array. No need to ask whether this is an ordinal the user 
asked to count at all. Later when we compute the top-K, we know more 
efficiently while root ordinal the user requested to count, and its children, 
so it's just a matter of putting everything into a heap and returning the top-K.

LUCENE-2369 uses exactly the same counting strategy (brute counting of 
everything). Not having explicit duplicates speeds this phase up and saves 
memory, but the numbers for d6 and d1 very clearly shows that it is faster 
overall to skip the drill-down in the extraction phase. At least for this test 
(and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by 
upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max 
depth 4 (2.5M unique paths)
||                             || LUCENE-3079 (d4) || LUCENE-3079 (d1) || 
LUCENE-2369 (d6) ||
| Index build time             |   28 min          |    -              |     17 
min        |
| Memory required for indexing |    ? MB           |    ? MB           |      ? 
MB         |
| First facet request          |13933 ms           |13367 ms           | 46,000 
ms         |
| Best of 5 requests           |11718 ms           |11036 ms           |   2989 
ms         |
| Memory usage after faceting  |  240 MB           |  240 MB           |    475 
MB         |

      was (Author: toke):
    I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M 
unique paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the 
path 'root/a/b' as starting point for the facet request was added. I also 
checked that indexing does indeed run with 48MB for LUCENE-2369, but again this 
is just plain Lucene indexing.

||                             || LUCENE-3079 (d6) || LUCENE-3079 (d1) || 
LUCENE-3079 (d1+'deep/a/b') || LUCENE-2369 (d6) ||
| Index build time             |  752 s            |   771 s           |    -   
        |     238 s         |
| Memory required for indexing | 2500 MB           | 2500 MB           |   2500 
MB           |      48 MB        |
| First facet request          | 3840 ms           | 1929 ms           |   1963 
ms           | 147,000 ms        |
| Best of 5 requests           | 2688 ms           | 1172 ms           |   1246 
ms           |    2673 ms        |
| Memory usage after faceting  |  435 MB           |  435 MB           |   435 
MB           |     294 MB        |

Going to depth 1 helped a lot. I would have expected that requesting from 
'deep/a/b' was even faster, but I guess I'll have to dig into the code to 
understand why it was not.

bq. It actually speeds up counting overall. If you think about it, when we 
encounter category ordinals, we just increment the count by 1 in the respective 
location in the count array. No need to ask whether this is an ordinal the user 
asked to count at all. Later when we compute the top-K, we know more 
efficiently while root ordinal the user requested to count, and its children, 
so it's just a matter of putting everything into a heap and returning the top-K.

LUCENE-2369 uses exactly the same counting strategy (brute counting of 
everything). Not having explicit duplicates speeds this phase up and saves 
memory, but the numbers for d6 and d1 very clearly shows that it is faster 
overall to skip the drill-down in the extraction phase. At least for this test 
(and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by 
upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max 
depth 4 (2.5M unique paths)
||                             || LUCENE-3079 (d4) || LUCENE-3079 (d1) || 
LUCENE-2369 (d6) ||
| Index build time             |   28 min          |    -              |     17 
min        |
| Memory required for indexing |    ? MB           |    ? MB           |      ? 
MB         |
| First facet request          |13933 ms           |13367 ms           | 46,000 
ms         |
| Best of 5 requests           |11718 ms           |11036 ms           |   2989 
ms         |
| Memory usage after faceting  |  240 MB           |  240 MB           |    475 
MB         |
  
> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, 
> LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to