Hi Ken, On Thu, Aug 16, 2012 at 1:28 AM, Ken Krugler <[email protected]> wrote:
> For many Tika parsers, the text you get back from the document starts with > the title (if any), and then contains the body. For clarity, the document we are testing can be found here [0] The title field contains the text "test rft document" and subject field "subject tests" the text field then contains "The quick brown fox…" however I'm not sure if it's the structure of the document that is throwing this one off. There is no doubt about it, doing parse.getText() definitely returns the text contained within the title field. > > So I'm wondering if what you're seeing in the test failure is that the > parse.getText() result is actually "test rtf document\nThe quick brown fox…" > > -- Ken > > On Aug 15, 2012, at 12:49pm, Lewis John Mcgibbney wrote: [0] https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/parse-tika/sample/test.rtf

