[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

Tim Allison (JIRA) Mon, 04 May 2015 18:47:38 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527751#comment-14527751
 ]


Tim Allison commented on TIKA-1315:
-----------------------------------

I've been trying to create a ListManager class that can be used for both doc 
and docx.  I found that we need to add a few classes at the POI level to get 
the number format string for docx (e.g. "%1.") into the ooxml-lite jar.  In 
[POI-57889 | https://bz.apache.org/bugzilla/show_bug.cgi?id=57889], I added 
code to XWPFParagraph to handle that and the override starts.   

I initially thought that the number format string isn't that important; but it 
really is, especially if the numbering is along the lines:
{noformat}
1
1.1
1.1.1
{noformat}
So, we'll have to wait for the release of the next version of POI before we can 
close out this ticket.  That said, I can and will continue to prep the code at 
the Tika level so that we're ready to go.  The next version of POI is due out 
in the next week or so.


> Basic list support in WordExtractor
> -----------------------------------
>
>                 Key: TIKA-1315
>                 URL: https://issues.apache.org/jira/browse/TIKA-1315
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Filip Bednárik
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: ListManager.tar.bz2, ListNumbering.patch, 
> ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other 
> way of contacting you and I don't quite understand how you manage forks and 
> pull requests (I don't think you do that). Plus I don't know your coding 
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc 
> documents, but TIKA doesn't support it. So I looked for solution and found 
> one here: 
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
>  . So I adapted this solution to Apache TIKA with few fixes and improvements. 
> Anyway feel free to use any of it so it can help people who struggle with 
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1315) Basic list support in WordExtractor

Reply via email to