It's not that it isn't required -- it's just that it stores less info
than before.
I changed the _X.tis format such that at each seekable point (every 128
terms by default), everything is written as absolutes (term text, freq
& prox offset). This means the _X.tii file only has to store the
indexed term & offset into the _X.tis file.
Then all we need to load into RAM are two column-stride arrays: the
long offset (into the _X.tis file) and the terms. Also, in RAM I
store the terms as String[] within a per-field class, instead of
Term[], which saves the object & 2 pointer overhead.
It's similar to how video muxers store their index into key frames,
where a key frame is an "absolute" frame that can be decoded without
seeing prior frames.
I think RAM savings should be at least 50% for "typical" terms (avg 10
chars say). Longer avg term length will see less savings. But, this
savings is only your term index, so if your tii file is smallish
net/net it won't reduce RAM usage that much.
When seeking is done, we look in the index to find the nearest spot in
_X.tis before the term we are looking for, jump there, read the
absolutes for that next() term, and then read deltas to continue
scanning.
This is coded up in the FormatPostingsTermsDictWriter/Reader classes.
Mike
Jason Rutherglen wrote:
Michael,
Can you describe a bit more about why the term dictionary index is
no longer required?
Jason
On Tue, Nov 18, 2008 at 7:41 AM, Michael McCandless (JIRA) <[EMAIL PROTECTED]
> wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1458:
---------------------------------------
Attachment: LUCENE-1458.patch
Woops, sorry... I was missing a bunch of files. Try this one?
> Further steps towards flexible indexing
> ---------------------------------------
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1458.patch, LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback. All tests pass, though back compat tests don't pass due
to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many
times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch
on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use
the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
> * Switches to a new more efficient terms dict format. This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo). At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta. Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String
(no
> more TermInfo array). It should be faster to init too.
> .
> This part is basically done.
> * Introduces modular reader codec that strongly decouples terms
dict
> from docs/positions readers. EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
> * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions. SegmentReader emulates
the
> old API on top of the new API to keep back-compat.
>
> Next steps:
> * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
> * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
> * Maybe switch to AttributeSources as the base class for
TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API
flexibility
> (not just index-file-format flexibility). EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
> * Test performance & iterate.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]