Piler should drop the usage of al those outdated libraries and use https://tika.apache.org/
> On 08 May 2019, at 10:52, Katterl Christian <[email protected]> wrote: > > In at least my case, this does not seem to work. > > BR, Christian > > > > Von: Janos SUTO <[email protected]> > Gesendet: Montag, 6. Mai 2019 11:33 > An: Piler User <[email protected]> > Betreff: Re: Indexation of Excel files newer than 2007 > > Newer office files, eg. xlsx, etc should be handled internally by the parser, > provided that you have libzip package installed as well as the header files, > libzip-dev or similar. > > Janos > From: Katterl Christian > Sent: Mon May 06 10:19:07 GMT+02:00 2019 > To: Piler User > Subject: AW: Indexation of Excel files newer than 2007 > > > Hello again, > > > for docx, there would be: https://github.com/ankushshah89/python-docx2txt > > > Unfortunately, I am not a software-developer to make the adoptions by myself. > > > BR Christian > > > > > Von: Martin Nadvornik <[email protected]> > Gesendet: Montag, 6. Mai 2019 09:46 > An: Piler User <[email protected]> > Betreff: Re: Indexation of Excel files newer than 2007 > > > Hello Christian, > > catdoc is not capable of processing new office formats. As far as I know > there is no intention for catdoc to implement this in a foreseeable future. > The same problem exists for xls2csv. You could theoretically try to call > unoconv (https://github.com/unoconv/unoconv) before catdoc, but it will > probably have a big performance impact since it launches libre office / open > office for the conversion. But if you try this I would be interested in your > results since being limited to index only old office formats is also > something we would like to overcome. Alternatively if you can find an open > source software which is capable of efficiently extracting plain text from > current office formats it should be easily implementable into piler > (basically a few lines in extract.c as far as I can tell). For excel there is > https://github.com/xevo/xls2csv and https://github.com/nagirrab/xls2csv which > claim to be cabable of proccessing xlsx files. But I haven't looked into them > yet. > > Kind Regards > Martin > > Am 06.05.2019 um 06:45 schrieb Katterl Christian: > Hello, > > Indexation of Excel files newer than Excel 2007 fails in my installation. > I am using catdoc 0.95 and it tells: > > This file looks like ZIP archive or Office 2007 or later file. > Not supported by catdoc > > The Excel-File has been created using Excel 2010. > > BR, Christian > > > Christian Katterl > Teamleader Technical IT > > > > Asamer Baustoffe AG > Unterthalham Straße 2 > 4694 Ohlsdorf > Austria > tel +43 50 799 - 2511 > mobile +43 664 811 54 99 > email [email protected] > www.abag.at > > This message is confidential. It may not be disclosed to, or used by, anyone > other than the addressee. If you receive this message by mistake, please > advise the sender. > Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334 > >
