[ 
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221074#comment-13221074
 ] 

Paolo Castagna commented on JENA-117:
-------------------------------------

Hi Sarven, here are a few answers to your questions.

> --compression Use compression for intermediate files 
> --gzip-outside GZIP...(Buffered...()) 
> --buffer-size The size of buffers for IO in bytes 
> --no-buffer Do not use Buffered{Input|Output}Stream

Those are all options to control DataOutputStream/DataInputStream which are 
used during the processing. In DataStreamFactory.java you can find this:

  if ( ! buffered ) {
      return new DataOutputStream( compression ? new GZIPOutputStream(out) : 
out ) ;
  } else {
      if ( gzip_outside ) {
          return new DataOutputStream( compression ? new GZIPOutputStream(new 
BufferedOutputStream(out, buffer_size)) : new BufferedOutputStream(out, 
buffer_size) ) ;
      } else {
          return new DataOutputStream( compression ? new 
BufferedOutputStream(new GZIPOutputStream(out, buffer_size)) : new 
BufferedOutputStream(out, buffer_size) ) ;                
      }
  }

This is me trying (with experiment) to find the best combination. I still do 
not have an answer, I welcome suggestions and results from experiments. That is 
the reason why I put those configuration parameters on the command line. 
Ideally, when we find what is working best, we should use that as default and 
either eliminate the parameters or leave them in for advanced users only. The 
buffer-size is 8192 bytes by default. 

> --spill-size The size of spillable segments in tuples|records
> --spill-size-auto Automatically set the size of spillable segments 
> --max-merge-files Specify the maximum number of files to merge at the same 
> time (default: 100) 

These are two other advanced users only parameters to allow experiments and 
find out what works best. tdbloader3 is using 'data bags' which spill data on 
disk because we cannot assume data at any stage to fit into RAM and we want to 
avoid disk seeks. So, for example, if we want to sort some data which do not 
fit in RAM we do in RAM in chunks, then dump to disk, process another chunk, 
etc. at the end we sort-merge all the chunks. --spill-size parameter control 
how many tuples you can keep in RAM before spilling to disk. This is not easy 
to know, it also depends on how many bytes per tuple and tuples are different 
sizes at different stages of computation. Ideally, users should not even think 
about this. This is why I tried to have an adaptive strategy (i.e. 
--spill-size-auto). With --spill-size-auto tdbloader3 will constantly monitor 
RAM available in the JVM and trigger the spilling on disk when the available 
RAM approaches a certain threshold. Things are more complicated if you have 
multiple threads and I am still unsure if this is a good stragegy or not. The 
aim is to have autotuning on by default and don't have users to think about 
spill sizes (see also: JENA-126 and JENA-157). --max-merge-files is used to 
specify the maximum number of files/chunks to sort-merge after each chunk has 
been sorted and spilled on disk. So, for example, if you end-up with 10000 
temporary files, the sort-merge will happy in two rounds: in the first round it 
generates 100 new files (sort-merging 100 files at the time) and then a last 
round to sort-merge the last 100 new generated files. This is because reading 
from too many files at the same time does not work well. Why 100? It says in 
the Hadoop source code that they found 100 works best for them... when doing a 
very similar thing. Here is another area where more experiments will help in 
finding a reasonable default value. 

> --no-stats Do not generate the stats file 

This is easy: by default tdbloader3 generates the stats.opt file (see "Choosing 
the optimizer strategy" section here: 
http://incubator.apache.org/jena/documentation/tdb/optimizer.html). You can 
ignore that option, stats.opt file can be generated later via TDB's tdbstats 
command line.

Now, your errors:

> $ java -cp 
> target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar 
> -server -d64 -Xmx2000M cmd.tdbloader3 --no-stats --compression --spill-size 
> 1500000 --> loc /usr/lib/fuseki/DB/WorldBank /tmp/indicators.tar.gz
> INFO Load: /tmp/indicators.tar.gz -- 2012/03/02 10:49:39 EST
> ERROR [line: 1, col: 13] Unknown char: (0) 

I think this is because your are tying to load a .gz which contains a tar with 
multiple files. tdbloader3 does not support that.
My advice is to convert and validate all your files from whatever format you 
have into N-Triples or N-Quads. 
Concatenate all the N-Triples or N-Quads files into a single .nt or .nq file 
and gzip it so that you end up with a single filename.nt.gz (which contains a 
single file).
Try loading that using tdbloader2 on a 64 bit machine with as much as RAM you 
have and use -Xmx2048m for the JVM.
If you try tdbloader3 as well on the same machine, give the JVM as much RAM as 
you can via -Xmx... since tdbloader3 does not use memory mapped files.

> $ java tdb.tdbquery --desc=/usr/lib/fuseki/tdb2.worldbank.ttl 'SELECT * WHERE 
> { ?s ?p ?o . } LIMIT 100'
> 10:56:30 WARN ModTDBDataset :: Unexpected: Not a TDB dataset for type 
> DatasetTDB 

Please, double check your tdb2.worldbank.ttl is pointing at the right directory.

> One final thing I'd like to know how to do is assigning graph names. --graph 
> is not available as it was in tdbloader. 

Right. One way to go around this would be to use files in N-Quads 
(http://sw.deri.org/2008/07/n-quads/) instead of N-Triples format. 

I have worked on tdbloader3 only "out-of-band", but things might change (if 
there are people interested).
You are not the only one needed some patience when dealing with > 500 million 
datasets. :-)
One dataset I want to experiment with is Freebase (i.e. ~ 600 million triples) 
and I have only 8 GB of RAM on my desktop. This certainly is a good experiment 
for tdbloader3.
                
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
>                 Key: JENA-117
>                 URL: https://issues.apache.org/jira/browse/JENA-117
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in 
> replacing the UNIX sort over text files with an external sorting pure Java 
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new 
> ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new 
> TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new 
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which 
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is 
> trivial.
> Preliminary results seems promising and show that the Java implementation can 
> be faster than UNIX sort since it uses smaller binary files (instead of text 
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is 
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is 
> that we could stream results directly into the BPlusTreeRewriter rather than 
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant 
> improvement.
> Using compression for intermediate files might help, but more experiments are 
> necessary to establish if it is worthwhile or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to