Re: Parsing order issue

Tim Allison Tue, 17 Dec 2019 16:12:39 -0800

Tilman,
   That isn’t correct. I’ll find the link that might help...

On Tue, Dec 17, 2019 at 1:02 PM Tilman Hausherr <[email protected]>
wrote:


> I already answered... we need the PDF.
>
> But... about the config:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>    <parsers>
>      <!-- Default Parser for most things, except for 2 mime types, and
> never
>           use the Executable Parser -->
>      <parser class="org.apache.tika.parser.DefaultParser">
>        <mime-exclude>image/jpeg</mime-exclude>
>        <mime-exclude>application/pdf</mime-exclude>
>        <parser-exclude
> class="org.apache.tika.parser.executable.ExecutableParser"/>
>      </parser>
>
>      <!-- Use a different parser for PDF -->
>      <parser class="org.apache.tika.parser.DefaultParser">
>      <property name="sortByPosition" value="true"/>
>        <mime>application/pdf</mime>
>      </parser>
>    </parsers>
> </properties>
>
> Is this a correct setting for PDFs in tika? I notice that the same
> parser class is used twice.
>
> And the file was named "tika.config", shouldn't it be named
> "tika-config.xml"?
>
> Tilman
>
> Am 17.12.2019 um 13:33 schrieb Tim Allison:
> > PDFBox Colleagues,
> >    Any recommendations?
> >
> > On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[email protected]> wrote:
> >
> >> Dear Tika Dev Team,
> >>
> >>
> >>
> >> Hope this email finds you well.
> >>
> >>
> >>
> >> I have been actively using Tika for pdf file reading. One issue I found
> is
> >> the parsing order. As shown in attached image, the parsing order of pdf
> >> file is not  based on position of texts.
> >>
> >>
> >>
> >> As suggested in this github link
> >> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
> >> customized config file (see attached), hoping to solve the issue. But
> this
> >> has not worked out. If any chance, can you please review this issue, and
> >> provide any insights or solutions?
> >>
> >>
> >>
> >> Thanks so much in advance.
> >>
> >>
> >>
> >> Regards,
> >>
> >> Luke
> >>
>
>

Re: Parsing order issue

Reply via email to