For the record, this is made even more complex by the fact that the disk footprint of a document depends on other documents that are indexed nearby in the same segment, and can change over merges.
Le jeu. 5 juil. 2018 à 08:22, Chris Bamford <ch...@bammers.net> a écrit : > Yes I see, I originally missed Terry’s response which is probably the > source of the confusion. > > So to clarify: I already know the size of the source document. As you say, > this bears little resemblance to what actually gets written when indexed. > It is this latter figure I was hoping to get. > > Thanks everyone. > > Chris > > > > > On 5 Jul 2018, at 03:31, Erick Erickson <erickerick...@gmail.com> wrote: > > > > I think we're not talking about the same thing. > > > > You asked "How can I calculate the total size of a Lucene Document"... > > > > I was responding to the Terry's comment "In the document types I > > usually index (.pdf, .docx/.doc, .eml), there exists a metadata field > > called "stream_size" that contains the size of the document on disk. " > > > > Two totally different beasts. One is the source document, the other is > > what you choose to put into the index from that document. Not to even > > mention that you could, for instance, choose to index only the title > > and throw everything else away so the size of the raw document on disk > > doesn't seem useful for your case. > > > > Best, > > Erick > > > >> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <ch...@bammers.net> > wrote: > >> Hi Erick > >> > >> Yes, size on disk is what I’m after as it will feed into an eventual > calculation regarding actual bytes written (not interested in the source > data document size, just real disk usage). > >> Thanks > >> > >> Chris > >> > >> Sent from my iPhone > >> > >>> On 4 Jul 2018, at 17:08, Erick Erickson <erickerick...@gmail.com> > wrote: > >>> > >>> But does size on disk help? If the doc has a zillion > >>> images in it, those aren't part of the resulting index > >>> (I'm excluding stored data here).... > >>> > >>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> > wrote: > >>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there > >>>> exists a metadata field called "stream_size" that contains the size of > >>>> the document on disk. You don't have to compute it. Thus, when you > >>>> retrieve each document you can pull out the contents of this field > and, > >>>> if you like, include it in each hitlist entry. > >>>> > >>>> > >>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote: > >>>>> Hi there, > >>>>> > >>>>> How can I calculate the total size of a Lucene Document that I'm > about > >>>>> to write to an index so I know how many bytes I am writing please? I > >>>>> need it for some external metrics collection. > >>>>> > >>>>> Thanks > >>>>> > >>>>> - Chris > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>>> > >>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >