[
https://issues.apache.org/jira/browse/TIKA-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322710#comment-15322710
]
Jukka Zitting commented on TIKA-2001:
-------------------------------------
By default Tika only extracts the text between XML tags, not things like
attribute values. Since all the content in this XML file is in the attributes,
nothing gets extracted.
What kind of output would make sense in this case?
Perhaps something like this:
{noformat}
0 0 2016-06-03 06:21:34 2016-06-03 06:21:37 0.002
0 0 0 0 0 0 0 0 2016-06-03 06:21:37 no
{noformat}
or like this:
{noformat}
spocosy
subscription-update subscriptionid 0 requestid 0 last_push 2016-06-03
06:21:34 current_push 2016-06-03 06:21:37 exec 0.002
lineup id 0 event_participantsFK 0 participantFK 0 lineup_typeFK 0
shirt_number 0 pos 0 enet_pos 0 n 0 ut 2016-06-03 06:21:37 del no
{noformat}
> Parsing XML outputs empty string
> --------------------------------
>
> Key: TIKA-2001
> URL: https://issues.apache.org/jira/browse/TIKA-2001
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11, 1.12, 1.13
> Reporter: George L. Yermulnik
> Priority: Minor
>
> Can't get Tika parse my xml files:
> {code}
> root@spring:/tmp# java -version
> java version "1.8.0_91"
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
> root@spring:/tmp# cat /tmp/xml/5751061032fbd-7148.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <spocosy version="1.0"><subscription-update subscriptionid="0" requestid="0"
> last_push="2016-06-03 06:21:34" current_push="2016-06-03 06:21:37"
> exec="0.002"><lineup id="0" event_participantsFK="0" participantFK="0"
> lineup_typeFK="0" shirt_number="0" pos="0" enet_pos="0" n="0" ut="2016-06-03
> 06:21:37" del="no"/></subscription-update></spocosy>
> root@spring:/tmp# for i in 3 2 1; do
> echo -n "tika-app-1.1${i}.jar: "
> java -jar tika-app-1.1${i}.jar --text /tmp/xml/5751061032fbd-7148.xml
> done
> tika-app-1.13.jar:
> tika-app-1.12.jar:
> tika-app-1.11.jar:
> root@spring:/tmp#
> {code}
> Appreciate any help. Thanx.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)