[ 
https://issues.apache.org/jira/browse/JENA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979009#comment-15979009
 ] 

Laura commented on JENA-1325:
-----------------------------

> using Redland command line tools may help becuse there is no java startup 
> overhead

this is indeed what I've been doing, looking for a C command line utility to 
avoid java overhead. I've found "rapper" which works fine but has a problem. It 
assigns sequential names to blank nodes, such as _:genId1, _:genId2, etc. This 
is a problem while parsing multiples files, because all generated n-triples 
have the same IDs for blank nodes, _:genId1 (since the count restarts for each 
file). So, when I `cat` all triples into a single .nt file, I have a huge blank 
node with properties from all files.
Unfortunately I don't know how to fix this, because there seems to be no flag 
available to use randomized names. If you know better, I'd like to know too :)

RIOT would work perfectly and without overhead since I can pass to it the 
complete list of files at once, for example using `files/*`, the problem is 
that the entire job stops/fails if there is one file with errors.

> close with "won't fix" because streaming parsing is critical

I would also agree to close as WONTFIX if this change would compromise 
streaming. But I'd like to understand how this change would actually compromise 
streaming at all... I don't understand this... If I pass a list of files to 
RIOT, why can't it just parse **and stream** one file after the other in 
sequence, while skipping bad files or bad triples? It would be like calling 
RIOT for each file, with the difference that instead of starting the VM every 
time, I just give it the whole list of files in one go.

> RIOT parse many files at once, output only valid ones
> -----------------------------------------------------
>
>                 Key: JENA-1325
>                 URL: https://issues.apache.org/jira/browse/JENA-1325
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>         Environment: GNU/Linux
>            Reporter: Laura
>              Labels: easyfix, performance
>
> This issue is more or less related to this other one 
> https://issues.apache.org/jira/browse/JENA-1322
> I have a folder with thousands of files, mostly small RDF/XML files. I'm 
> using RIOT to validate them and dump the valid ones into ntriples files. The 
> problem is that calling RIOT on each file is not going to cut it. The 
> overhead is significant enough that this operation is just too slow (hours). 
> So I've tried to call RIOT only once on all files together using
> {noformat}
>     riot \
>         --verbose \
>         --stop \
>         --check \
>         --strict \
>         --output=nt \
>         files/*.rdf > files.nt
> {noformat}
> and in this way validation is much faster. The problem is, that it's still 
> dumping invalid files to the .nt output file. I'm downloading these files 
> from the Internet, so I'm not going to fix them myself, I just want to skip 
> bad files.
> Now, to be clear, I understand that RIOT is of course not meant to fix bad 
> data, and I'm not asking for this. I'm suggesting however to add an 
> *--option* such that RIOT can do the following:
> 1. parse multiple files at once (so that there is no need to invoke the same 
> RIOT command for each file)
> 2. for every file, check/validate it
> 3. if *--output* is set, only output those files or triples that didn't raise 
> any ERROR
> I think this is well in the scope of RIOT functionalities. Could this option 
> please be added to RIOT?
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to