Repository: incubator-rya
Updated Branches:
  refs/heads/master 3c3ab0dfd -> 5463da23c


Make RdfFileInputTool to accept multiple input paths. Doc improvements


Project: http://git-wip-us.apache.org/repos/asf/incubator-rya/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-rya/commit/639b980c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-rya/tree/639b980c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-rya/diff/639b980c

Branch: refs/heads/master
Commit: 639b980ce80677ec4703ba39e19cfd9e7943c506
Parents: 3c3ab0d
Author: Maxim Kolchin <kolchin...@gmail.com>
Authored: Wed Jul 4 13:04:30 2018 +0300
Committer: Maxim Kolchin <kolchin...@gmail.com>
Committed: Wed Jul 4 13:04:30 2018 +0300

----------------------------------------------------------------------
 extras/rya.manual/src/site/markdown/loaddata.md | 48 +++++++++++++----
 .../rya.manual/src/site/markdown/quickstart.md  |  4 +-
 mapreduce/pom.xml                               | 56 ++++++++++----------
 .../rya/accumulo/mr/AbstractAccumuloMRTool.java |  6 +--
 .../rya/accumulo/mr/tools/RdfFileInputTool.java |  2 +-
 .../accumulo/mr/tools/RdfFileInputToolTest.java | 40 +++++++++++---
 mapreduce/src/test/resources/test2.ntriples     |  3 ++
 7 files changed, 108 insertions(+), 51 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-rya/blob/639b980c/extras/rya.manual/src/site/markdown/loaddata.md
----------------------------------------------------------------------
diff --git a/extras/rya.manual/src/site/markdown/loaddata.md 
b/extras/rya.manual/src/site/markdown/loaddata.md
index e5c7bd2..9d43edd 100644
--- a/extras/rya.manual/src/site/markdown/loaddata.md
+++ b/extras/rya.manual/src/site/markdown/loaddata.md
@@ -21,7 +21,7 @@
 -->
 # Load Data
 
-There are a few mechanisms to load data
+There are a few mechanisms to load data.
 
 ## Web REST endpoint
 
@@ -92,29 +92,55 @@ The default "format" is RDF/XML, but these formats are 
supported : RDFXML, NTRIP
 
 ## Bulk Loading data
 
-Bulk loading data is done through Map Reduce jobs
+Bulk loading data is done through Map Reduce jobs.
 
 ### Bulk Load RDF data
 
-This Map Reduce job will read files into memory and parse them into 
statements. The statements are saved into the store. Here is an example for 
storing in Accumulo:
+This Map Reduce job will read files into memory and parse them into 
statements. The statements are saved into the triplestore. 
+Here are the steps to prepare and run the job:
+
+  * Load the RDF data to HDFS. It can be single of multiple volumes and 
directories in them.
+  * Also load the `mapreduce/target/rya.mapreduce-<version>-shaded.jar` 
executable jar file to HDFS.
+  * Run the following sample command:
 
 ```
-hadoop jar target/rya.mapreduce-3.2.10-SNAPSHOT-shaded.jar 
org.apache.rya.accumulo.mr.RdfFileInputTool -Dac.zk=localhost:2181 
-Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret 
-Drdf.tablePrefix=triplestore_ -Drdf.format=N-Triples /tmp/temp.ntrips
+hadoop hdfs://volume/rya.mapreduce-<version>-shaded.jar 
org.apache.rya.accumulo.mr.tools.RdfFileInputTool -Dac.zk=localhost:2181 
-Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret 
-Drdf.tablePrefix=triplestore_ -Drdf.format=N-Triples 
hdfs://volume/dir1,hdfs://volume/dir2,hdfs://volume/file1.nt
 ```
 
 Options:
 
-- rdf.tablePrefix : The tables (spo, po, osp) are prefixed with this 
qualifier. The tables become: 
(rdf.tablePrefix)spo,(rdf.tablePrefix)po,(rdf.tablePrefix)osp
-- ac.* : Accumulo connection parameters
-- rdf.format : See RDFFormat from RDF4J, samples include (Trig, N-Triples, 
RDF/XML)
-- sc.use_freetext, sc.use_geo, sc.use_temporal, sc.use_entity : If any of 
these are set to true, statements will also be
+- **rdf.tablePrefix** - The tables (spo, po, osp) are prefixed with this 
qualifier.
+    The tables become: 
(rdf.tablePrefix)spo,(rdf.tablePrefix)po,(rdf.tablePrefix)osp
+- **ac.*** - Accumulo connection parameters
+- **rdf.format** - See RDFFormat from RDF4J, samples include (Trig, N-Triples, 
RDF/XML)
+- **sc.use_freetext, sc.use_geo, sc.use_temporal, sc.use_entity** - If any of 
these are set to true, statements will also be
     added to the enabled secondary indices.
-- sc.freetext.predicates, sc.geo.predicates, sc.temporal.predicates: If the 
associated indexer is enabled, these options specify
+- **sc.freetext.predicates, sc.geo.predicates, sc.temporal.predicates** - If 
the associated indexer is enabled, these options specify
     which statements should be sent to that indexer (based on the predicate). 
If not given, all indexers will attempt to index
     all statements.
 
-The argument is the directory/file to load. This file needs to be loaded into 
HDFS before running. If loading a directory, all files should have the same RDF
-format.
+The positional argument is a comma separated list of directories/files to load.
+They need to be loaded into HDFS before running. If loading a directory,
+all files should have the same RDF format.
+
+Once the data is loaded, it is actually a good practice to compact your tables.
+You can do this by opening the accumulo shell shell and running the compact
+command on the generated tables. Remember the generated tables will be
+prefixed by the rdf.tablePrefix property you assigned above.
+The default tablePrefix is `rts`.
+Here is a sample Accumulo Shell command:
+
+```
+compact -p triplestore_(.*)
+```
+
+### Generate Prospects table
+
+For the best query performance, it is recommended to run the job that
+creates the Prospects table. This job will read through your data and
+gather statistics on the distribution of the dataset. This table is then
+queried before query execution to reorder queries based on the data
+distribution. See the [Prospects Table](eval.md) section on how to do this.
 
 ## Direct RDF4J API
 

http://git-wip-us.apache.org/repos/asf/incubator-rya/blob/639b980c/extras/rya.manual/src/site/markdown/quickstart.md
----------------------------------------------------------------------
diff --git a/extras/rya.manual/src/site/markdown/quickstart.md 
b/extras/rya.manual/src/site/markdown/quickstart.md
index f0d76a8..7a93cda 100644
--- a/extras/rya.manual/src/site/markdown/quickstart.md
+++ b/extras/rya.manual/src/site/markdown/quickstart.md
@@ -56,7 +56,7 @@ Start the Tomcat server. `./bin/startup.sh`
 
 ## Usage
 
-First, we need to load data. See the [Load Data Section] (loaddata.md)
+First, we need to load data. See the [Load Data](loaddata.md) section.
 
-Second, we need to query that data. See the [Query Data Section](querydata.md)
+Second, we need to query that data. See the [Query Data](querydata.md) section.
 

http://git-wip-us.apache.org/repos/asf/incubator-rya/blob/639b980c/mapreduce/pom.xml
----------------------------------------------------------------------
diff --git a/mapreduce/pom.xml b/mapreduce/pom.xml
index dc3cec4..bc019da 100644
--- a/mapreduce/pom.xml
+++ b/mapreduce/pom.xml
@@ -88,6 +88,35 @@ under the License.
     </dependencies>
 
     <build>
+        <plugins>
+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-shade-plugin</artifactId>
+                <executions>
+                    <execution>
+                        <phase>package</phase>
+                        <goals>
+                            <goal>shade</goal>
+                        </goals>
+                        <configuration>
+                            <filters>
+                                <filter>
+                                    <artifact>*:*</artifact>
+                                    <excludes>
+                                        <exclude>META-INF/*.SF</exclude>
+                                        <exclude>META-INF/*.DSA</exclude>
+                                        <exclude>META-INF/*.RSA</exclude>
+                                    </excludes>
+                                </filter>
+                            </filters>
+                            <transformers>
+                                <transformer 
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"
 />
+                            </transformers>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
+        </plugins>
         <pluginManagement>
             <plugins>
                 <plugin>
@@ -101,33 +130,6 @@ under the License.
                         </excludes>
                     </configuration>
                 </plugin>
-                <plugin>
-                    <groupId>org.apache.maven.plugins</groupId>
-                    <artifactId>maven-shade-plugin</artifactId>
-                    <executions>
-                        <execution>
-                            <phase>package</phase>
-                            <goals>
-                                <goal>shade</goal>
-                            </goals>
-                            <configuration>
-                                <filters>
-                                    <filter>
-                                        <artifact>*:*</artifact>
-                                        <excludes>
-                                            <exclude>META-INF/*.SF</exclude>
-                                            <exclude>META-INF/*.DSA</exclude>
-                                            <exclude>META-INF/*.RSA</exclude>
-                                        </excludes>
-                                    </filter>
-                                </filters>
-                                <transformers>
-                                    <transformer 
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"
 />
-                                </transformers>
-                            </configuration>
-                        </execution>
-                    </executions>
-                </plugin>
             </plugins>
         </pluginManagement>
     </build>

http://git-wip-us.apache.org/repos/asf/incubator-rya/blob/639b980c/mapreduce/src/main/java/org/apache/rya/accumulo/mr/AbstractAccumuloMRTool.java
----------------------------------------------------------------------
diff --git 
a/mapreduce/src/main/java/org/apache/rya/accumulo/mr/AbstractAccumuloMRTool.java
 
b/mapreduce/src/main/java/org/apache/rya/accumulo/mr/AbstractAccumuloMRTool.java
index 7489391..cd29e1e 100644
--- 
a/mapreduce/src/main/java/org/apache/rya/accumulo/mr/AbstractAccumuloMRTool.java
+++ 
b/mapreduce/src/main/java/org/apache/rya/accumulo/mr/AbstractAccumuloMRTool.java
@@ -209,18 +209,18 @@ public abstract class AbstractAccumuloMRTool implements 
Tool {
      * ({@link org.apache.hadoop.io.LongWritable}, {@link 
RyaStatementWritable})
      * pairs from RDF file(s) found at the specified path.
      * @param   job   Job to configure
-     * @param   inputPath     File or directory name
+     * @param   commaSeparatedPaths a comma separated list of files or 
directories
      * @param   defaultFormat  Default RDF serialization format, can be
      *                         overridden by {@link MRUtils#FORMAT_PROP}
      * @throws  IOException if there's an error interacting with the
      *          {@link org.apache.hadoop.fs.FileSystem}.
      */
-    protected void setupFileInput(Job job, String inputPath, RDFFormat 
defaultFormat) throws IOException {
+    protected void setupFileInputs(Job job, String commaSeparatedPaths, 
RDFFormat defaultFormat) throws IOException {
         RDFFormat format = MRUtils.getRDFFormat(conf);
         if (format == null) {
             format = defaultFormat;
         }
-        RdfFileInputFormat.addInputPath(job, new Path(inputPath));
+        RdfFileInputFormat.addInputPaths(job, commaSeparatedPaths);
         RdfFileInputFormat.setRDFFormat(job, format);
         job.setInputFormatClass(RdfFileInputFormat.class);
     }

http://git-wip-us.apache.org/repos/asf/incubator-rya/blob/639b980c/mapreduce/src/main/java/org/apache/rya/accumulo/mr/tools/RdfFileInputTool.java
----------------------------------------------------------------------
diff --git 
a/mapreduce/src/main/java/org/apache/rya/accumulo/mr/tools/RdfFileInputTool.java
 
b/mapreduce/src/main/java/org/apache/rya/accumulo/mr/tools/RdfFileInputTool.java
index c004f4e..5d7209a 100644
--- 
a/mapreduce/src/main/java/org/apache/rya/accumulo/mr/tools/RdfFileInputTool.java
+++ 
b/mapreduce/src/main/java/org/apache/rya/accumulo/mr/tools/RdfFileInputTool.java
@@ -65,7 +65,7 @@ public class RdfFileInputTool extends AbstractAccumuloMRTool 
implements Tool {
         job.setJarByClass(RdfFileInputTool.class);
 
         String inputPath = conf.get(MRUtils.INPUT_PATH, args[0]);
-        setupFileInput(job, inputPath, RDFFormat.RDFXML);
+        setupFileInputs(job, inputPath, RDFFormat.RDFXML);
         setupRyaOutput(job);
         job.setNumReduceTasks(0);
 

http://git-wip-us.apache.org/repos/asf/incubator-rya/blob/639b980c/mapreduce/src/test/java/org/apache/rya/accumulo/mr/tools/RdfFileInputToolTest.java
----------------------------------------------------------------------
diff --git 
a/mapreduce/src/test/java/org/apache/rya/accumulo/mr/tools/RdfFileInputToolTest.java
 
b/mapreduce/src/test/java/org/apache/rya/accumulo/mr/tools/RdfFileInputToolTest.java
index 8f92cf1..020122b 100644
--- 
a/mapreduce/src/test/java/org/apache/rya/accumulo/mr/tools/RdfFileInputToolTest.java
+++ 
b/mapreduce/src/test/java/org/apache/rya/accumulo/mr/tools/RdfFileInputToolTest.java
@@ -19,7 +19,6 @@ package org.apache.rya.accumulo.mr.tools;
  * under the License.
  */
 
-import junit.framework.TestCase;
 import org.apache.accumulo.core.client.Connector;
 import org.apache.accumulo.core.client.admin.SecurityOperations;
 import org.apache.accumulo.core.client.mock.MockInstance;
@@ -29,10 +28,12 @@ import org.apache.accumulo.core.security.TablePermission;
 import org.apache.rya.accumulo.AccumuloRdfConfiguration;
 import org.apache.rya.accumulo.mr.TestUtils;
 import org.apache.rya.api.RdfCloudTripleStoreConstants;
+import org.apache.rya.api.domain.RyaIRI;
 import org.apache.rya.api.domain.RyaStatement;
 import org.apache.rya.api.domain.RyaType;
-import org.apache.rya.api.domain.RyaIRI;
 import org.eclipse.rdf4j.rio.RDFFormat;
+import org.junit.After;
+import org.junit.Before;
 import org.junit.Test;
 
 /**
@@ -41,7 +42,7 @@ import org.junit.Test;
  * Time: 10:51 AM
  * To change this template use File | Settings | File Templates.
  */
-public class RdfFileInputToolTest extends TestCase {
+public class RdfFileInputToolTest {
 
     private String user = "user";
     private String pwd = "pwd";
@@ -50,9 +51,8 @@ public class RdfFileInputToolTest extends TestCase {
     private Authorizations auths = new Authorizations("test_auths");
     private Connector connector;
 
-    @Override
+    @Before
     public void setUp() throws Exception {
-        super.setUp();
         connector = new MockInstance(instance).getConnector(user, new 
PasswordToken(pwd));
         connector.tableOperations().create(tablePrefix + 
RdfCloudTripleStoreConstants.TBL_SPO_SUFFIX);
         connector.tableOperations().create(tablePrefix + 
RdfCloudTripleStoreConstants.TBL_PO_SUFFIX);
@@ -70,9 +70,8 @@ public class RdfFileInputToolTest extends TestCase {
         secOps.grantTablePermission(user, tablePrefix + 
RdfCloudTripleStoreConstants.TBL_EVAL_SUFFIX, TablePermission.WRITE);
     }
 
-    @Override
+    @After
     public void tearDown() throws Exception {
-        super.tearDown();
         connector.tableOperations().delete(tablePrefix + 
RdfCloudTripleStoreConstants.TBL_SPO_SUFFIX);
         connector.tableOperations().delete(tablePrefix + 
RdfCloudTripleStoreConstants.TBL_PO_SUFFIX);
         connector.tableOperations().delete(tablePrefix + 
RdfCloudTripleStoreConstants.TBL_OSP_SUFFIX);
@@ -104,6 +103,33 @@ public class RdfFileInputToolTest extends TestCase {
     }
 
     @Test
+    public void testMultipleNTriplesInputs() throws Exception {
+        RdfFileInputTool.main(new String[]{
+                "-Dac.mock=true",
+                "-Dac.instance=" + instance,
+                "-Dac.username=" + user,
+                "-Dac.pwd=" + pwd,
+                "-Dac.auth=" + auths.toString(),
+                "-Dac.cv=" + auths.toString(),
+                "-Drdf.tablePrefix=" + tablePrefix,
+                "-Drdf.format=" + RDFFormat.NTRIPLES.getName(),
+                
"src/test/resources/test.ntriples,src/test/resources/test2.ntriples",
+        });
+        RyaStatement rs1 = new RyaStatement(new 
RyaIRI("urn:lubm:rdfts#GraduateStudent01"),
+                new RyaIRI("urn:lubm:rdfts#hasFriend"),
+                new RyaIRI("urn:lubm:rdfts#GraduateStudent02"));
+        RyaStatement rs2 = new RyaStatement(new 
RyaIRI("urn:lubm:rdfts#GraduateStudent05"),
+                new RyaIRI("urn:lubm:rdfts#hasFriend"),
+                new RyaIRI("urn:lubm:rdfts#GraduateStudent07"));
+        rs1.setColumnVisibility(auths.toString().getBytes());
+        rs2.setColumnVisibility(auths.toString().getBytes());
+        AccumuloRdfConfiguration conf = new AccumuloRdfConfiguration();
+        conf.setTablePrefix(tablePrefix);
+        conf.setAuths(auths.toString());
+        TestUtils.verify(connector, conf, rs1, rs2);
+    }
+
+    @Test
     public void testInputContext() throws Exception {
         RdfFileInputTool.main(new String[]{
                 "-Dac.mock=true",

http://git-wip-us.apache.org/repos/asf/incubator-rya/blob/639b980c/mapreduce/src/test/resources/test2.ntriples
----------------------------------------------------------------------
diff --git a/mapreduce/src/test/resources/test2.ntriples 
b/mapreduce/src/test/resources/test2.ntriples
new file mode 100644
index 0000000..692f66a
--- /dev/null
+++ b/mapreduce/src/test/resources/test2.ntriples
@@ -0,0 +1,3 @@
+<urn:lubm:rdfts#GraduateStudent05> <urn:lubm:rdfts#hasFriend> 
<urn:lubm:rdfts#GraduateStudent07> .
+<urn:lubm:rdfts#GraduateStudent06> <urn:lubm:rdfts#hasFriend> 
<urn:lubm:rdfts#GraduateStudent06> .
+<urn:lubm:rdfts#GraduateStudent07> <urn:lubm:rdfts#hasFriend> 
<urn:lubm:rdfts#GraduateStudent05> .

Reply via email to