Exception is rethrown only if write limit not reached. So if exception was on first 100k chars it affects the result. If exception is thrown after that -- it will be suppressed.
-- Best regards, Konstantin Gribov. 28.03.2014 13:32 пользователь "Stefano Fornari" <[email protected]> написал: > Hi Jukka, > thanks a lot for your reply. > > On #1 I am still wondering why for indexing we need structure information. > is there any particular reason? wouldn't make more sense to get just the > text by default and only optionally getting the structure? > > On #2, I expected the code you presented would not work. And in fact the > pattern is quite odd, isn't it? What is the reason of throwing the > exception if limiting the text read is a legal use case? (I am asking just > to understand the background). > > Ste > > Ste > > > On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting <[email protected] > >wrote: > > > Hi, > > > > On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari > > <[email protected]> wrote: > > > 1. is the use of PDF2XHTML necessary? why is the pdf turned into an > > XHTML? > > > for the purpose of indexing, wouldn't just the text be enough? > > > > The XHTML output allows us to annotate the extracted text with > > structural information (like "this is a heading", "here's a > > hyperlink", etc.) that would be difficult to express with text-only > > output. A client that needs just the text content can get it easily > > with the BodyContentHandler class. > > > > > 2. I need to limit the index of the content to files whose size is > below > > to > > > a certain threshold; I was wondering if this could be a parser > > > configuration option and thus if you would accept this change. > > > > Do you want to entirely exclude too large files, or just index the > > first few pages of such files (which is more common in many indexing > > use cases)? > > > > The latter use case be implemented with the writeLimit parameter of > > the WriteOutContentHandler class, like this: > > > > // Extract up to 100k characters from a given document > > WriteOutContentHandler out = new WriteOutContentHandler(100_000); > > try { > > parser.parse(..., new BodyContentHandler(out), ...); > > } catch (SAXException e) { > > if (!out.isWriteLimitReached(e)) { > > throw e; > > } > > } > > String content = out.toString(); > > > > BR, > > > > Jukka Zitting > > >
