Re: AW: Indexation of Excel files newer than 2007

[email protected] Wed, 08 May 2019 02:05:50 -0700

Piler should drop the usage of al those outdated libraries and use 
https://tika.apache.org/


> On 08 May 2019, at 10:52, Katterl Christian <[email protected]> wrote:
> 
> In at least my case, this does not seem to work.
>  
> BR, Christian
>  
>  
>  
> Von: Janos SUTO <[email protected]> 
> Gesendet: Montag, 6. Mai 2019 11:33
> An: Piler User <[email protected]>
> Betreff: Re: Indexation of Excel files newer than 2007
>  
> Newer office files, eg. xlsx, etc should be handled internally by the parser, 
> provided that you have libzip package installed as well as the header files, 
> libzip-dev or similar.
> 
> Janos
> From: Katterl Christian 
> Sent: Mon May 06 10:19:07 GMT+02:00 2019
> To: Piler User 
> Subject: AW: Indexation of Excel files newer than 2007
> 
>  
> Hello again,
>  
> 
> for docx, there would be: https://github.com/ankushshah89/python-docx2txt
>  
> 
> Unfortunately, I am not a software-developer to make the adoptions by myself.
>  
> 
> BR Christian
>  
> 
>  
> 
> Von: Martin Nadvornik <[email protected]> 
> Gesendet: Montag, 6. Mai 2019 09:46
> An: Piler User <[email protected]>
> Betreff: Re: Indexation of Excel files newer than 2007
>  
> 
> Hello Christian,
> 
> catdoc is not capable of processing new office formats. As far as I know 
> there is no intention for catdoc to implement this in a foreseeable future. 
> The same problem exists for xls2csv. You could theoretically try to call 
> unoconv (https://github.com/unoconv/unoconv) before catdoc, but it will 
> probably have a big performance impact since it launches libre office / open 
> office for the conversion. But if you try this I would be interested in your 
> results since being limited to index only old office formats is also 
> something we would like to overcome. Alternatively if you can find an open 
> source software which is capable of efficiently extracting plain text from 
> current office formats it should be easily implementable into piler 
> (basically a few lines in extract.c as far as I can tell). For excel there is 
> https://github.com/xevo/xls2csv and https://github.com/nagirrab/xls2csv which 
> claim to be cabable of proccessing xlsx files. But I haven't looked into them 
> yet.
> 
> Kind Regards
> Martin
> 
> Am 06.05.2019 um 06:45 schrieb Katterl Christian:
> Hello,
>  
> Indexation of Excel files newer than Excel 2007 fails in my installation.
> I am using catdoc 0.95 and it tells:
>  
> This file looks like ZIP archive or Office 2007 or later file.
> Not supported by catdoc
>  
> The Excel-File has been created using Excel 2010.
>  
> BR, Christian
> 
> 
> Christian Katterl
> Teamleader Technical IT 
> 
> 
> 
> Asamer Baustoffe AG
> Unterthalham Straße 2
> 4694 Ohlsdorf
> Austria
> tel  +43 50 799 - 2511
> mobile +43 664 811 54 99
> email [email protected]
> www.abag.at
> 
> This message is confidential. It may not be disclosed to, or used by, anyone 
> other than the addressee. If you receive this message by mistake, please 
> advise the sender.
> Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334
> 
>

Re: AW: Indexation of Excel files newer than 2007

Reply via email to