Re: Sending a document to IndexWriter field by field

Michael McCandless Thu, 20 Feb 2014 13:19:36 -0800

Yes, all postings for the entire doc are held in RAM data structures
... you could make your own indexing chain to somehow change this
behavior, but I don't think that's an easy task.


Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 20, 2014 at 4:02 PM, Igor Shalyminov
<[email protected]> wrote:
> Mike, thank you!
>
> So eventually this amount of data must stay entirely in RAM (as postings) 
> before flushing to disk?
> Can it be hacked?)
>
> The documents themselves (that I will deliver to user) are of a regular size, 
> but features that I generate grow combinatorially in size and blow the index 
> up in some sense.
> I definitely want to think about breaking them into pieces, thank you for the 
> advice!
>
>
> --
> Best Regards,
> Igor Shalyminov
>
>
> 21.02.2014, 00:50, "Michael McCandless" <[email protected]>:
>> Yes, in 4.x IndexWriter now takes an Iterable that enumerates the
>> fields one at a time.
>>
>> You can also pass a Reader to a Field.
>>
>> That said, there will still be massive RAM required by IW to hold the
>> inverted postings for that one document, likely much more RAM than the
>> original document's String contents.
>>
>> And, such huge documents are rarely useful in practice.  E.g., how
>> will you "deliver" that hit to the end user at search time?  Will
>> scores actually make sense for such enormous documents?  It's better
>> to break them up into more manageable sizes.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Feb 20, 2014 at 3:22 PM, Igor Shalyminov
>> <[email protected]> wrote:
>>
>>>  Hello!
>>>
>>>  I'va faced a problem of indexing huge documents. The indexing itself goes 
>>> allright, but when the document processing becomes concurrent, 
>>> OutOfMemories start appearing (even with heap of about 32GB).
>>>  The issue, as I see it, is that I have to create a Document instance to 
>>> send it to IndexWriter, and Document is just a collection of all the 
>>> fields, all in RAM.
>>>  With my huge fields, it would be so much better to have the ability of 
>>> sending document fields for writing one by one, keeping no more than a 
>>> single field in RAM.
>>>  Is it possible in the latest Lucene?
>>>
>>>  --
>>>  Best Regards,
>>>  Igor Shalyminov
>>>
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: [email protected]
>>>  For additional commands, e-mail: [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Sending a document to IndexWriter field by field

Reply via email to