[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Jiang updated LUCENE-3069: ------------------------------ Attachment: example.png LUCENE-3069.patch Uploaded patch, it is the main part of changes I commited to branch3069. The picture shows current impl of outputs (it is fetched from one field in wikimedium5k). * long[] (sortable metadata) * byte[] (unsortable, generic metadata) * df, ttf (term stats) A single byte flag is used to indicate whether/which fields current outputs maintains, for PBF with short byte[], this should be enough. Also, for long-tail terms, the totalTermFreq an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms have df == ttf). Since TermsEnum is totally based on FSTEnum, the performance of term dict should be similar with MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this hurts. Following is the performance comparison: {noformat} pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall Task QPS base StdDev QPS comp StdDev Pct diff Respell 48.13 (4.4%) 15.38 (1.0%) -68.0% ( -70% - -65%) Fuzzy2 51.30 (5.3%) 17.47 (1.3%) -65.9% ( -68% - -62%) Fuzzy1 52.24 (4.0%) 18.50 (1.2%) -64.6% ( -67% - -61%) Wildcard 9.31 (1.7%) 6.16 (2.2%) -33.8% ( -37% - -30%) Prefix3 23.25 (1.8%) 19.00 (2.2%) -18.3% ( -21% - -14%) PKLookup 244.92 (3.6%) 225.42 (2.3%) -8.0% ( -13% - -2%) LowTerm 295.88 (5.5%) 293.27 (4.8%) -0.9% ( -10% - 9%) HighPhrase 13.62 (6.5%) 13.54 (7.4%) -0.6% ( -13% - 14%) MedTerm 99.51 (7.8%) 99.19 (7.7%) -0.3% ( -14% - 16%) MedPhrase 154.63 (9.4%) 154.38 (10.1%) -0.2% ( -17% - 21%) HighTerm 28.25 (10.7%) 28.25 (10.0%) -0.0% ( -18% - 23%) OrHighHigh 16.83 (13.3%) 16.86 (13.1%) 0.2% ( -23% - 30%) HighSloppyPhrase 9.02 (4.4%) 9.03 (4.5%) 0.2% ( -8% - 9%) LowPhrase 6.26 (3.4%) 6.27 (4.1%) 0.2% ( -7% - 8%) OrHighMed 13.73 (13.2%) 13.77 (12.8%) 0.3% ( -22% - 30%) OrHighLow 25.65 (13.2%) 25.73 (13.0%) 0.3% ( -22% - 30%) MedSloppyPhrase 6.63 (2.7%) 6.66 (2.7%) 0.5% ( -4% - 6%) AndHighMed 42.77 (1.8%) 43.13 (1.5%) 0.8% ( -2% - 4%) LowSloppyPhrase 32.68 (3.0%) 32.96 (2.8%) 0.8% ( -4% - 6%) AndHighHigh 22.90 (1.2%) 23.18 (0.7%) 1.2% ( 0% - 3%) LowSpanNear 29.30 (2.0%) 29.83 (2.2%) 1.8% ( -2% - 6%) MedSpanNear 8.39 (2.7%) 8.56 (2.9%) 2.0% ( -3% - 7%) IntNRQ 3.12 (1.9%) 3.18 (6.7%) 2.1% ( -6% - 10%) AndHighLow 507.01 (2.4%) 522.10 (2.8%) 3.0% ( -2% - 8%) HighSpanNear 5.43 (1.8%) 5.60 (2.6%) 3.1% ( -1% - 7%) {noformat} {noformat} pure TempFST vs. pure Lucene41, on wikimediumall Task QPS base StdDev QPS comp StdDev Pct diff Respell 49.24 (2.7%) 15.51 (1.0%) -68.5% ( -70% - -66%) Fuzzy2 52.01 (4.8%) 17.61 (1.4%) -66.1% ( -68% - -63%) Fuzzy1 53.00 (4.0%) 18.62 (1.3%) -64.9% ( -67% - -62%) Wildcard 9.37 (1.3%) 6.15 (2.1%) -34.4% ( -37% - -31%) Prefix3 23.36 (0.8%) 18.96 (2.1%) -18.8% ( -21% - -16%) MedPhrase 155.86 (9.8%) 152.34 (9.7%) -2.3% ( -19% - 19%) LowPhrase 6.33 (3.7%) 6.23 (4.0%) -1.6% ( -8% - 6%) HighPhrase 13.68 (7.2%) 13.49 (6.8%) -1.4% ( -14% - 13%) OrHighMed 13.78 (13.0%) 13.68 (12.7%) -0.8% ( -23% - 28%) HighSloppyPhrase 9.14 (5.2%) 9.07 (3.7%) -0.7% ( -9% - 8%) OrHighHigh 16.87 (13.3%) 16.76 (12.9%) -0.6% ( -23% - 29%) OrHighLow 25.71 (13.1%) 25.58 (12.8%) -0.5% ( -23% - 29%) MedSloppyPhrase 6.69 (2.7%) 6.67 (2.4%) -0.3% ( -5% - 4%) LowSloppyPhrase 33.01 (3.2%) 32.99 (2.6%) -0.1% ( -5% - 5%) MedTerm 99.64 (8.0%) 99.67 (10.9%) 0.0% ( -17% - 20%) LowTerm 294.52 (5.5%) 295.72 (7.2%) 0.4% ( -11% - 13%) LowSpanNear 29.61 (2.6%) 29.76 (2.7%) 0.5% ( -4% - 5%) IntNRQ 3.13 (1.8%) 3.16 (7.8%) 0.8% ( -8% - 10%) MedSpanNear 8.49 (3.0%) 8.57 (3.4%) 0.9% ( -5% - 7%) AndHighMed 42.86 (1.4%) 43.35 (1.4%) 1.1% ( -1% - 3%) AndHighHigh 22.98 (0.6%) 23.26 (0.5%) 1.2% ( 0% - 2%) HighSpanNear 5.51 (3.4%) 5.58 (3.4%) 1.3% ( -5% - 8%) HighTerm 28.32 (10.5%) 28.76 (15.0%) 1.6% ( -21% - 30%) AndHighLow 509.60 (2.2%) 526.17 (1.9%) 3.3% ( 0% - 7%) PKLookup 156.59 (2.2%) 225.47 (2.8%) 44.0% ( 38% - 50%) {noformat} To revive the performance on automaton queries, intersect methods should be implemented. And index size comparison: (actually, after LUCENE-5029, TempBlock has a little larger (5%) index size than Lucene41) {noformat} wikimedium1m wikimediumall Memory 2,212,352 / Lucene41 448,164 12,104,520 TempFST 525,888 12,770,700 {noformat} as for term dict size: {noformat} wikimedium1m wikimediumall Lucene41(.tim+.tip) 157776 2059744 TempFST(.tmp) 233636 2779784 48% 35% {noformat} Some unresolved problems: * Currently, TempFST uses the default option to build FST (i.e. doPacked = false), when this option is switched on, the index size on wikimedium1m becomes smaller, but on wikimediumall it becomes larger? > Lucene should have an entirely memory resident term dictionary > -------------------------------------------------------------- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search > Affects Versions: 4.0-ALPHA > Reporter: Simon Willnauer > Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org