Tilman, That isn’t correct. I’ll find the link that might help... On Tue, Dec 17, 2019 at 1:02 PM Tilman Hausherr <[email protected]> wrote:
> I already answered... we need the PDF. > > But... about the config: > > <?xml version="1.0" encoding="UTF-8"?> > <properties> > <parsers> > <!-- Default Parser for most things, except for 2 mime types, and > never > use the Executable Parser --> > <parser class="org.apache.tika.parser.DefaultParser"> > <mime-exclude>image/jpeg</mime-exclude> > <mime-exclude>application/pdf</mime-exclude> > <parser-exclude > class="org.apache.tika.parser.executable.ExecutableParser"/> > </parser> > > <!-- Use a different parser for PDF --> > <parser class="org.apache.tika.parser.DefaultParser"> > <property name="sortByPosition" value="true"/> > <mime>application/pdf</mime> > </parser> > </parsers> > </properties> > > Is this a correct setting for PDFs in tika? I notice that the same > parser class is used twice. > > And the file was named "tika.config", shouldn't it be named > "tika-config.xml"? > > Tilman > > Am 17.12.2019 um 13:33 schrieb Tim Allison: > > PDFBox Colleagues, > > Any recommendations? > > > > On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <[email protected]> wrote: > > > >> Dear Tika Dev Team, > >> > >> > >> > >> Hope this email finds you well. > >> > >> > >> > >> I have been actively using Tika for pdf file reading. One issue I found > is > >> the parsing order. As shown in attached image, the parsing order of pdf > >> file is not based on position of texts. > >> > >> > >> > >> As suggested in this github link > >> <https://github.com/chrismattmann/tika-python/issues/266>, I used a > >> customized config file (see attached), hoping to solve the issue. But > this > >> has not worked out. If any chance, can you please review this issue, and > >> provide any insights or solutions? > >> > >> > >> > >> Thanks so much in advance. > >> > >> > >> > >> Regards, > >> > >> Luke > >> > >
