[ 
https://issues.apache.org/jira/browse/JENA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318462#comment-17318462
 ] 

Andy Seaborne commented on JENA-2083:
-------------------------------------

Hi there - how much data (in triples?)

> some of the files are incorrectly serialized,

What sort of errors? Some errors are limited to the current triple but some 
errors would remove several following triples in order to do some kind of 
recovery.

> It is not feasible right now to sort out the defective files from the good 
> ones before running tdbloader.

The best approach is to validate the files first by running them though 
{{riot}}. Parsing is faster than loading and with separate files, can be done 
in parallel.

With any kind of recovery, the end result is that you don't know what data has 
actually been loaded and so when using the data, there will be unpredictable 
problems (e.g. queries mysteriously not matching). Such problems are painful 
and expensive to diagnose and fix. As a general rule, bad data in a database is 
hard and time-consuming to fix.

N-Triples can be processed by test handling tools - useful to patch up 
systematic errors.

tdb2.tdbloader provides a number of algorithms. Some work by manipulating the 
transaction system and some work with parallelism, both of which make error 
recovery hard and have an impact on performance.

The basic algorithm is transactional (as is loading into Fuseki with a TDB2 
database). A file, or set of files, will load completely or not at all and 
leave the database in the state it was in before.



> Support skipping/ignoring errors with tdbloader
> -----------------------------------------------
>
>                 Key: JENA-2083
>                 URL: https://issues.apache.org/jira/browse/JENA-2083
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: TDB, TDB2
>            Reporter: Timothy Higinbottom
>            Priority: Major
>
> Hi all,
> I have a fairly large (~22,000) number of N-Triples files I hope to import 
> into TDB2 to query with Fuseki.
> I boosted the RAM allotted to the JVM and used the parallel mode from 
> tdb2.tdbloader. This whizzed through the first 1,000 of the files.
> However, some of the files are incorrectly serialized, so they caused errors 
> when Jena tried to read them. It is not feasible right now to sort out the 
> defective files from the good ones before running tdbloader.
> It would be great if tdbloader could add an option to skip the files that 
> error so that it can continue to process the other files.
> The main reason this should be part of tdbloader itself is that the 
> alternative (running xargs or a loop in Bash) decreases performance because 
> then the loading is effectively synchronous and the user can't take advantage 
> of the tdbloader modes and batching.
> Thanks for this great project!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to