[
https://issues.apache.org/jira/browse/TIKA-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322715#comment-15322715
]
George L. Yermulnik commented on TIKA-2001:
-------------------------------------------
> By default Tika only extracts the text between XML tags, not things like
> attribute values. Since all the content in this XML file is in the
> attributes, nothing gets extracted.
Oh! I see. I'm new to Tika and hadn't known that.
> What kind of output would make sense in this case?
In my case the second variant would be more preferable. But I'm not sure if
that's what Tika is intended to deal with.
> Parsing XML outputs empty string
> --------------------------------
>
> Key: TIKA-2001
> URL: https://issues.apache.org/jira/browse/TIKA-2001
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11, 1.12, 1.13
> Reporter: George L. Yermulnik
> Priority: Minor
>
> Can't get Tika parse my xml files:
> {code}
> root@spring:/tmp# java -version
> java version "1.8.0_91"
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
> root@spring:/tmp# cat /tmp/xml/5751061032fbd-7148.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <spocosy version="1.0"><subscription-update subscriptionid="0" requestid="0"
> last_push="2016-06-03 06:21:34" current_push="2016-06-03 06:21:37"
> exec="0.002"><lineup id="0" event_participantsFK="0" participantFK="0"
> lineup_typeFK="0" shirt_number="0" pos="0" enet_pos="0" n="0" ut="2016-06-03
> 06:21:37" del="no"/></subscription-update></spocosy>
> root@spring:/tmp# for i in 3 2 1; do
> echo -n "tika-app-1.1${i}.jar: "
> java -jar tika-app-1.1${i}.jar --text /tmp/xml/5751061032fbd-7148.xml
> done
> tika-app-1.13.jar:
> tika-app-1.12.jar:
> tika-app-1.11.jar:
> root@spring:/tmp#
> {code}
> Appreciate any help. Thanx.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)