Hi Oleg, UIMA could be useful for extracting text from XML (I'm not familiar enough with it...), but I think we should still fix Tika's own XML extraction.
Mike McCandless http://blog.mikemccandless.com On Thu, Dec 20, 2012 at 6:14 AM, Oleg Tikhonov <[email protected]> wrote: > Hi Make, > > May be consider using of UIMA ("the rule engine") ? > > BR, > Oleg > > > > On Thu, Dec 20, 2012 at 1:05 PM, Michael McCandless (JIRA) > <[email protected]>wrote: > >> >> [ >> https://issues.apache.org/jira/browse/TIKA-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] >> >> Michael McCandless updated TIKA-1048: >> ------------------------------------- >> >> Attachment: TIKA-1048.patch >> >> Patch w/ failing test ... I'm not sure where/how to best fix this yet ... >> >> > XMLParser should add whitespace between elements >> > ------------------------------------------------ >> > >> > Key: TIKA-1048 >> > URL: https://issues.apache.org/jira/browse/TIKA-1048 >> > Project: Tika >> > Issue Type: Bug >> > Components: parser >> > Reporter: Michael McCandless >> > Fix For: 1.3 >> > >> > Attachments: TIKA-1048.patch >> > >> > >> > If the incoming XML is compact (ie doesn't have whitespace between >> elements), I think we should somehow add whitespace between elements when >> extracting text? >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators >> For more information on JIRA, see: http://www.atlassian.com/software/jira >>
