[ 
https://issues.apache.org/jira/browse/JENA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975503#comment-15975503
 ] 

Andy Seaborne edited comment on JENA-1325 at 4/19/17 9:26 PM:
--------------------------------------------------------------

[~rvesse] captures the issue perfected. Command line tools can not cover all 
possible uses.

The "riot" command can handle very large files. This is important.

What the OP is asking for is a two-pass algorithm (the parse whole file into a 
graph, output if and only if no errors) which can be achieved with a small 
program that reads the files, one at a time (no start up overhead) - or with 
some parallelism where different files for different threads.

Or call "riot.main" in a loop in a java program to avoid the JVM start-up 
overhead.

Being open source, the code of riot can be used as inspiration or as a basis 
for coding.

Note: Running the riot command in parallel, outputting to separate files, may 
well help, depending on the hardware being used.



was (Author: andy.seaborne):
[~rvesse] captures the issue perfected. Command line tools can not cover all 
possible uses.

The "riot" command can handle very large files. This is important.

What the OP is asking for a two-pass algorithm (the parse whole file into a 
graph, output if and only if no errors) which can be achieved with a small 
program that reads the files, one at a time (no start up overhead) - or with 
some parallelism where different files for different threads.

Or even calling "riot.main" from java which avoids the java start-up overhead.

Being open source, the code of riot can be used as inspiration or as a basis 
for coding.

Note: Running the riot command in parallel, outputting to separate files, may 
well help, depending on the hardware being used.


> RIOT parse many files at once, output only valid ones
> -----------------------------------------------------
>
>                 Key: JENA-1325
>                 URL: https://issues.apache.org/jira/browse/JENA-1325
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>         Environment: GNU/Linux
>            Reporter: Laura
>              Labels: easyfix, performance
>
> This issue is more or less related to this other one 
> https://issues.apache.org/jira/browse/JENA-1322
> I have a folder with thousands of files, mostly small RDF/XML files. I'm 
> using RIOT to validate them and dump the valid ones into ntriples files. The 
> problem is that calling RIOT on each file is not going to cut it. The 
> overhead is significant enough that this operation is just too slow (hours). 
> So I've tried to call RIOT only once on all files together using
> {noformat}
>     riot \
>         --verbose \
>         --stop \
>         --check \
>         --strict \
>         --output=nt \
>         files/*.rdf > files.nt
> {noformat}
> and in this way validation is much faster. The problem is, that it's still 
> dumping invalid files to the .nt output file. I'm downloading these files 
> from the Internet, so I'm not going to fix them myself, I just want to skip 
> bad files.
> Now, to be clear, I understand that RIOT is of course not meant to fix bad 
> data, and I'm not asking for this. I'm suggesting however to add an 
> *--option* such that RIOT can do the following:
> 1. parse multiple files at once (so that there is no need to invoke the same 
> RIOT command for each file)
> 2. for every file, check/validate it
> 3. if *--output* is set, only output those files or triples that didn't raise 
> any ERROR
> I think this is well in the scope of RIOT functionalities. Could this option 
> please be added to RIOT?
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to