> it'd be nice to genericize MultiLevelSkipListWriter so that it could index arbitrary files
+1 on this idea. Using skip lists for the term index would be an improvement. On Tue, Nov 18, 2008 at 12:27 PM, Michael McCandless (JIRA) <[EMAIL PROTECTED] > wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648739#action_12648739] > > Michael McCandless commented on LUCENE-1458: > -------------------------------------------- > > bq. Can we design a format that allows us rely upon the operating system's > virtual memory and avoid caching in process memory altogether? > > Interesting! I've been wondering what you're up to over on KS, Marvin :) > > I'm not sure it'll be a win in practice: I'm not sure I'd trust the > OS's IO cache to "make the right decisions" about what to cache. Plus > during that binary search the IO system is loading whole pages into > the IO cache, even though you'll only peak at the first few bytes of > each. > > We could also explore something in-between, eg it'd be nice to > genericize MultiLevelSkipListWriter so that it could index arbitrary > files, then we could use that to index the terms dict. You could > choose to spend dedicated process RAM on the higher levels of the skip > tree, and then tentatively trust IO cache for the lower levels. > > I'd like to eventually make the TermsDict index pluggable so one could > swap in different indexers like this (it's not now). > > > > Further steps towards flexible indexing > > --------------------------------------- > > > > Key: LUCENE-1458 > > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > > Project: Lucene - Java > > Issue Type: New Feature > > Components: Index > > Affects Versions: 2.9 > > Reporter: Michael McCandless > > Assignee: Michael McCandless > > Priority: Minor > > Fix For: 2.9 > > > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch > > > > > > I attached a very rough checkpoint of my current patch, to get early > > feedback. All tests pass, though back compat tests don't pass due to > > changes to package-private APIs plus certain bugs in tests that > > happened to work (eg call TermPostions.nextPosition() too many times, > > which the new API asserts against). > > [Aside: I think, when we commit changes to package-private APIs such > > that back-compat tests don't pass, we could go back, make a branch on > > the back-compat tag, commit changes to the tests to use the new > > package private APIs on that branch, then fix nightly build to use the > > tip of that branch?o] > > There's still plenty to do before this is committable! This is a > > rather large change: > > * Switches to a new more efficient terms dict format. This still > > uses tii/tis files, but the tii only stores term & long offset > > (not a TermInfo). At seek points, tis encodes term & freq/prox > > offsets absolutely instead of with deltas delta. Also, tis/tii > > are structured by field, so we don't have to record field number > > in every term. > > . > > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > > . > > RAM usage when loading terms dict index is significantly less > > since we only load an array of offsets and an array of String (no > > more TermInfo array). It should be faster to init too. > > . > > This part is basically done. > > * Introduces modular reader codec that strongly decouples terms dict > > from docs/positions readers. EG there is no more TermInfo used > > when reading the new format. > > . > > There's nice symmetry now between reading & writing in the codec > > chain -- the current docs/prox format is captured in: > > {code} > > FormatPostingsTermsDictWriter/Reader > > FormatPostingsDocsWriter/Reader (.frq file) and > > FormatPostingsPositionsWriter/Reader (.prx file). > > {code} > > This part is basically done. > > * Introduces a new "flex" API for iterating through the fields, > > terms, docs and positions: > > {code} > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > > {code} > > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > > old API on top of the new API to keep back-compat. > > > > Next steps: > > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > > fix any hidden assumptions. > > * Expose new API out of IndexReader, deprecate old API but emulate > > old API on top of new one, switch all core/contrib users to the > > new API. > > * Maybe switch to AttributeSources as the base class for TermsEnum, > > DocsEnum, PostingsEnum -- this would give readers API flexibility > > (not just index-file-format flexibility). EG if someone wanted > > to store payload at the term-doc level instead of > > term-doc-position level, you could just add a new attribute. > > * Test performance & iterate. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >