Re: How to create document objects in our case

Michael McCandless Sun, 22 May 2011 06:36:22 -0700

You're welcome!

Mike


http://blog.mikemccandless.com

On Sun, May 22, 2011 at 9:20 AM, zhoucheng2008 <zhoucheng2...@gmail.com> wrote:
> Great, thanks Mike.
>
> -----Original Message-----
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Sunday, May 22, 2011 8:09 PM
> To: java-user@lucene.apache.org
> Subject: Re: How to create document objects in our case
>
> Norms is how Lucene records what the apriori boost is for each
> docXfield.  This boost is the product of per-field boost, per-doc
> boost (both of which your app would set when it creates the doc), as
> well as the "length normalization" Lucene's default similarity applies
> (shorter docs have higher boost).
>
> It's a quantized float, encoded as 8 bits.
>
> If your fields tend not to vary in length much, and you don't use
> boosting yourself, and you are worried about RAM, you should omit the
> norms.  (Field has a method to omit norms).
>
> Not sure when nested docs will be available but I hope soon... even
> so, once it's committed, it won't be released until the next release
> (though you can use it on the trunk/3.x tip as well, once it's
> committed).
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Sun, May 22, 2011 at 7:49 AM, zhoucheng2008 <zhoucheng2...@gmail.com>
> wrote:
>> Mike, thanks for reply.
>>
>> Can you please elaborate a little bit more on " If you don't need norms
>> (don't boost, lengths don't vary much or you
>> don't care to have field length impact scoring) you can omit norms"?
>>
>> When do you expect the handling of nested document will be applicable?
>>
>> Cheng
>>
>>
>> -----Original Message-----
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: Sunday, May 22, 2011 6:58 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: How to create document objects in our case
>>
>> 30 fields is fine, but if they are all indexed you should watch out
>> for memory usage.  Ie, norms require 1 byte per doc per indexed field.
>>  If you don't need norms (don't boost, lengths don't vary much or you
>> don't care to have field length impact scoring) you can omit norms.
>>
>> The relationship b/w Group and Subgroup is not something Lucene can do
>> today -- all docs are "independent" in a Lucene index.
>>
>> So.. you can either "denormalize", meaning duplicate all Group fields
>> onto each subgroup, or you can check out
>> https://issues.apache.org/jira/browse/LUCENE-2454 which I think adds
>> exactly what you need.  It's not yet committed, but we are finally
>> starting to make progress towards that, eg
>> https://issues.apache.org/jira/browse/LUCENE-3112
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Fri, May 20, 2011 at 8:27 PM, Cheng Zhou <zhoucheng2...@gmail.com>
> wrote:
>>> Hi,
>>>
>>> I have a large number of XML files to be indexed by Lucene. All the files
>>> share similar structure as below:
>>>
>>> <Group id="abc" member="cde" blah blah ....>
>>>   <Subgroup id="abc1" member ="fgh" blah blah ...>
>>>   <Subgroup id="abc2" member ="fgh" blah blah ...>
>>>   <Subgroup id="abc3" member ="fgh" blah blah ...>
>>>   ......
>>> </Group>
>>>
>>> Things to be noted are:
>>>
>>> The root element of Group has 30 or so attributes, and it usually has
> over
>>> 2000 Subgroup elements, which in turn also have more than 20 attributes.
>>>
>>> I want to create one Document object which holds the contents of the
> Group
>>> element, and one Document object which holds all the Subgroup elements.
>>>
>>> Here are my challenges however:
>>>
>>> 1. How many fields are advised for a Document to be indexed by Lucene?
>> Will
>>> over 30 fields (for the Group element) be too many?
>>>
>>> 2. How to create a Document object and fields for holding all the
> Subgroup
>>> elements? Is this a good way to think of?
>>>
>>> 3. How can I link the Document object of the Group element to the
> Document
>>> object of all the Subgroup elements?
>>>
>>> Please note that I intend to use such two Document objects to achieve the
>>> group while I don't know whether it is a good solution or not. I am open
>> to
>>> using more than two Documents to do the job, but I don't know how to
>> connect
>>> all the objects in Lucene.
>>>
>>> Many thanks!
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to create document objects in our case

Reply via email to