[ 
https://issues.apache.org/jira/browse/TIKA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603946#comment-14603946
 ] 

Tim Allison edited comment on TIKA-1601 at 6/27/15 6:44 PM:
------------------------------------------------------------

Not anywhere near committing, but this is a rough start.

Some TODOs:
* -Figure out how to get non-ascii text out correctly-
* Figure out how to grab attachments from the accdb file
* Figure out if there's a flag for html-marked up text cells so that we can 
strip the markup [0]
* Figure out if there's a way to prevent Jackcess from trying to open linked 
files [0]
* Add unit tests :)

I used [~centic]'s code [1] to pull ~3k mdb files from CommonCrawl for testing.

[0]: https://sourceforge.net/p/jackcess/discussion/456474/thread/038878e6/
[1]: https://github.com/centic9/CommonCrawlDocumentDownload



was (Author: [email protected]):
Not anywhere near committing, but this is a rough start.

Some TODOs:
* Figure out how to get non-ascii text out correctly
* Figure out how to grab attachments from the accdb file
* Add unit tests :)


> Integrate Jackcess to handle MSAccess files
> -------------------------------------------
>
>                 Key: TIKA-1601
>                 URL: https://issues.apache.org/jira/browse/TIKA-1601
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>         Attachments: jackcess_nocommit_v1.patch, testAccess2.zip
>
>
> Recently, James Ahlborn, the current maintainer of 
> [Jackcess|http://jackcess.sourceforge.net/], kindly agreed to relicense 
> Jackcess to Apache 2.0.  [~boneill], the CTO at [Health Market Science, a 
> LexisNexis® Company|https://www.healthmarketscience.com/], also agreed with 
> this relicensing and led the charge to obtain all necessary corporate 
> approval to deliver a 
> [CCLA|https://www.apache.org/licenses/cla-corporate.txt] for Jackcess to 
> Apache.  As anyone who has tried to get corporate approval for anything 
> knows, this can sometimes require not a small bit of effort.
> If I may speak on behalf of Tika and the larger Apache community, I offer a 
> sincere thanks to James, Brian and the other developers and contributors to 
> Jackcess!!!
> Once the licensing info has been changed in Jackcess and the new release is 
> available in maven, we can integrate Jackcess into Tika and add a capability 
> to process MSAccess.
> As a side note, I reached out to the developers and contributors to determine 
> if there were any objections.  I couldn't find addresses for everyone, and 
> not everyone replied, but those who did offered their support to this move. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to