Re: Adding clear() to Document

Shai Erera Wed, 20 May 2009 13:40:54 -0700

I'm actually not after any performance improvements, just after convenience.
Like I said, today I have an awkward way to do clear(), instead I want to
add clear() which will do a simple fields.clear(). Since Document keeps an
ArrayList of fields, calling clear() on it is not that expensive (nulling
the array and setting size to 0).


Regarding re-adding the fields back - if your documents *always* have the
same set of fields, i.e., for each document the same and all fields appear,
then you don't need clear(). You can simply create a Document, add to it all
the Field instances, with empty values and then have your code set their
values.

clear() is required when documents have *largely the same set of fields*, as
in Web documents for example. Every document will have the common title, id,
body, date, keywords field, but each may also have fields according to the
meta tags. In that case you cannot simply keep them on the Document, since
the next one might not have all of them and may add new ones.

Therefore what I had in mind is to keep a Map<String, Field> which maps a
field name to its instance. Clear the Document's fields. Then add them back
one-by-one according to the fields I find in the page I'm parsing. I use the
map to obtain the Field instance for reuse (or create new ones if they don't
exist).

In that approach, I don't see the difference between adding the fields to a
List and pass it to Document, or clear() the Document and add the fields
one-by-one.

On Wed, May 20, 2009 at 11:25 PM, Yonik Seeley
<yo...@lucidimagination.com>wrote:

> Compared to caching and passing in a List to the Document constructor,
> I imagine a clear() based solution would be slower... there's more
> work to do.  clear() needs to null the pointers, and then one needs to
> add the fields again, one-by-one.  But I doubt we'd be able to detect
> a variance anyway, given that document construction time (as opposed
> to Field construction) is insignificant compared to indexing.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Wed, May 20, 2009 at 4:10 PM, Shai Erera <ser...@gmail.com> wrote:
> > I came across this while working on 1595 (changes to benchmark). I
> noticed
> > LineDocMaker reuses Document and Fields, and I wanted to pull that up to
> a
> > base DocMaker since I got the impression it yields better (even if not
> > significant) performance.
> >
> > With the addition of the Field ctor which accepts a boolean for
> interning,
> > and with the changes to String.intern() which are to come, I agree this
> is
> > will have less impact, but is still convenient. Today, I can already call
> > doc.getFields(), iterate on them and call doc.remove(Field).
> > Document.clear() will just save me the trouble.
> >
> > Besides all the above changes, reusing Document and FIeld saves object
> > allocations. For the documents in the benchmark package this may mean
> > millions of Document objects + much more Field objects. Even if it always
> > avoided interning, this means saving lots of allocations, which are
> really
> > not necessary.
> >
> > For other applications, the number of fields may be much larger than in
> the
> > current benchmark impls, where it becomes even more important.
> >
> > Passing a list of Fields will save the Field allocations (assuming the
> app
> > caches them on the outside) but still require Document allocation. Why
> not
> > save that either?
> >
> > On Wed, May 20, 2009 at 11:01 PM, Yonik Seeley <
> yo...@lucidimagination.com>
> > wrote:
> >>
> >> On Wed, May 20, 2009 at 3:27 PM, Shai Erera <ser...@gmail.com> wrote:
> >> > I noticed Document does not have a clear() method, to remove all the
> >> > Fields
> >> > set on it.
> >>
> >> Document's state is so simple (a List and a boost), reuse doesn't seem
> >> worth it.
> >> What if, instead, we allowed the List to be passed into via Document's
> >> constructor?
> >>
> >> To put it into perspective, the Document object then becomes lighter
> >> weight than the String object (provided the user is caching the List
> >> of fields).  And really, I think caching the list of fields is even
> >> overboard for pretty much all of the applications out there - I doubt
> >> it would ever be significant given how much relative work is needed to
> >> index a document.
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Adding clear() to Document

Reply via email to