[jira] [Updated] (GIRAPH-171) total time in MasterThread.run() is calculated incorrectly

2012-04-05 Thread Eugene Koontz (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/GIRAPH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koontz updated GIRAPH-171:
-

Attachment: GIRAPH-171.patch

> total time in MasterThread.run() is calculated incorrectly
> --
>
> Key: GIRAPH-171
> URL: https://issues.apache.org/jira/browse/GIRAPH-171
> Project: Giraph
>  Issue Type: Bug
>Reporter: Eugene Koontz
>Assignee: Eugene Koontz
> Attachments: GIRAPH-171.patch
>
>
> While running PageMarkBenchMark, I was seeing in the output:
> {{graph.MasterThread(172): total: Took 1.3336739262910001E9 seconds.}}
> This was because currently, in {{MasterThread.run()}}, we have:
> {code}
> LOG.info("total: Took " +
>  ((System.currentTimeMillis() / 1000.0d) -
>  setupSecs) + " seconds.");
> {code}
> but it should be:
> {code}
>LOG.info("total: Took " +
>((System.currentTimeMillis() - startMillis) /
>   1000.0d) + " seconds.");
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (GIRAPH-171) total time in MasterThread.run() is calculated incorrectly

2012-04-05 Thread Eugene Koontz (Created) (JIRA)
total time in MasterThread.run() is calculated incorrectly
--

 Key: GIRAPH-171
 URL: https://issues.apache.org/jira/browse/GIRAPH-171
 Project: Giraph
  Issue Type: Bug
Reporter: Eugene Koontz
Assignee: Eugene Koontz
 Attachments: GIRAPH-171.patch

While running PageMarkBenchMark, I was seeing in the output:

{{graph.MasterThread(172): total: Took 1.3336739262910001E9 seconds.}}

This was because currently, in {{MasterThread.run()}}, we have:

{code}
LOG.info("total: Took " +
 ((System.currentTimeMillis() / 1000.0d) -
 setupSecs) + " seconds.");
{code}

but it should be:

{code}
   LOG.info("total: Took " +
   ((System.currentTimeMillis() - startMillis) /
  1000.0d) + " seconds.");
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247542#comment-13247542
 ] 

Paolo Castagna commented on GIRAPH-170:
---

bq. we may want to consider therefore adding backlinks

Yep. I'd like to better understand what people currently do if they need 
incoming and outgoing links for their processing.
An adjacency list can be constructed listing incoming (a.k.a. backlinks) as 
well as outgoing links, in one MapReduce job.

Input:

s1 -p1-> o1
s1 -p2-> o2
s1 -p2-> o3
s2 -p1-> s1
s2 ...

Output (adjacency list):

s1 (out: p1 o1) (out: p2 o2) (out: p2 o3) (in: s2 p1)
s2 ...

Whether it is better to do it this way or have support from the Giraph APIs 
avoiding an initial MapReduce job to construct the adjacency list, I do not 
know yet.

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Dan Brickley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247506#comment-13247506
 ] 

Dan Brickley commented on GIRAPH-170:
-

Another architectural note around RDF:

RDF is basically simple factual data expressed as sets of binary relationships. 
In that sense it is a graph directly, already. 

However often RDF describes something that is in a deeper sense also a graph. 
Common examples include FOAF, where node and edge types (Person, Document, 
Group, etc.) can express matrix of collaboration, social linkage, etc. Or from 
DBpedia.org, Freebase etc., we have for example datasets of movies and actors. 
In the dbpedia case, it's simple enough; a movie node, an actor node, and a 
typed link between them. Freebase by contrast, reifies the 'starring' 
relationship into another node, ... so you can represent dates, character name 
etc. This sort of meta-information (properties of links) is also btw in the 
BluePrints/Gremlin API.

One point here is that a 'starring' link pointing from a Movie to an Actor, 
tells us the same, but in reverse, as what we would have learned from a 
'starsIn' link from the Actor to the Movie. For Giraph we may want to consider 
therefore adding backlinks so each node is equally aware of properties pointing 
both in, and out.


> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: On helping new contributors pitch in quickly...

2012-04-05 Thread Dan Brickley
On 5 April 2012 17:05, Avery Ching  wrote:
> Dan, you're definitely right that this has been mentioned a few times.  The
> multigraph issue is one part of it, but a helper VertexInputFormat (and
> maybe VertexOutputFormat) would certainly still help as you mention.  Can
> you please open a JIRA (and help if you have time)?

Here you go: https://issues.apache.org/jira/browse/GIRAPH-170

I've tried to summarise discussion from here and elsewhere.

Dan


[jira] [Created] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Dan Brickley (Created) (JIRA)
Workflow for loading RDF graph data into Giraph
---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor


W3C RDF provides a family of Web standards for exchanging graph-based data. RDF 
uses sets of simple binary relationships, labeling nodes and links with Web 
identifiers (URIs). Many public datasets are available as RDF, including the 
"Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many such 
datasets are listed at http://thedatahub.org/

RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
language is Turtle. Apache Jena and Any23 provide software to handle all these; 
http://incubator.apache.org/jena/ http://incubator.apache.org/any23/

This JIRA leaves open the strategy for loading RDF data into Giraph. There are 
various possibilites, including exploitation of intermediate Hadoop-friendly 
stores, or pre-processing with e.g. Pig-based tools into a more Giraph-friendly 
form, or writing custom loaders. Even a HOWTO document or implementor notes 
here would be an advance on the current state of the art. The BluePrints Graph 
API (Gremlin etc.) has also been aligned with various RDF datasources.

Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
touches on the issue (since we can't currently easily represent fully general 
RDF graphs since two nodes might be connected by more than one typed edge). 
Even without multigraphs it ought to be possible to bring RDF-sourced data
into Giraph, e.g. perhaps some app is only interested in say the Movies + 
People subset of a big RDF collection.

>From Avery in email: "a helper VertexInputFormat (and maybe 
>VertexOutputFormat) would certainly [despite GIRAPH-141] still help"



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Dan Brickley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247490#comment-13247490
 ] 

Dan Brickley commented on GIRAPH-170:
-

>From Paulo in email:

""" I suspect N-Triples | N-Quads might not be the best option for
something like Giraph. Something more like an adjacency list might be
better.

So, my intuition, is that if you start with RDF in N-Triples format,
the first step would be a simple MapReduce job to group RDF statements
by subject (eventually filtering out certain properties):

Input:

 s1 --p1--> o1
 s1 --p2--> o2
 s1 --p2--> o3
 s2 ...

Output (adjacency list):

 s1 (p1 o1) (p2 o2) (p2 o3)
 s2 ..."""

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (GIRAPH-168) Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than HADOOP_FACEBOOK) and remove usage of HADOOP

2012-04-05 Thread Eugene Koontz (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/GIRAPH-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koontz updated GIRAPH-168:
-

Attachment: GIRAPH-168.patch

-Update hadoop_trunk to look for Hadoop version 3.0.0-SNAPSHOT
-change HADOOP_OLDRPC munge flag to more descriptive HADOOP_NON_SASL_RPC

> Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than 
> HADOOP_FACEBOOK) and remove usage of HADOOP
> -
>
> Key: GIRAPH-168
> URL: https://issues.apache.org/jira/browse/GIRAPH-168
> Project: Giraph
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Eugene Koontz
>Assignee: Eugene Koontz
> Attachments: GIRAPH-168.patch, GIRAPH-168.patch, GIRAPH-168.patch, 
> GIRAPH-168.patch
>
>
> This JIRA relates to the mail thread here: 
> http://mail-archives.apache.org/mod_mbox/incubator-giraph-dev/201203.mbox/browser
> Currently we check for the munge flags HADOOP, HADOOP_FACEBOOK and 
> HADOOP_NON_SECURE when using munge in a few places. Hopefully we can 
> eliminate usage of munge in the future, but until then, we can mitigate the 
> complexity by consolidating the number of flags checked. This JIRA renames 
> HADOOP_FACEBOOK to HADOOP_SECURE, and removes usages of HADOOP, to handle the 
> same conditional compilation requirements. It also makes it easier to add 
> more maven profiles so that we can easily increase our hadoop version 
> coverage.
> This patch modifies the existing hadoop_facebook profile to use the new 
> HADOOP_SECURE munge flag, rather than HADOOP_FACEBOOK.
> It also adds a new hadoop maven profile, hadoop_trunk, which also sets 
> HADOOP_SECURE. 
> Finally, it adds a default profile, hadoop_0.20.203. This is needed so that 
> we can specify its dependencies separately from hadoop_trunk, because the 
> hadoop dependencies have changed between trunk and 0.205.0 - the former 
> requires hadoop-common, hadoop-mapreduce-client-core, and 
> hadoop-mapreduce-client-common, whereas the latter requires hadoop-core. 
> With this patch, the following passes:
> {code}
> mvn clean verify && mvn -Phadoop_trunk clean verify && mvn -Phadoop_0.20.203 
> clean verify
> {code}
> Current problems: 
> * I left in place the usage of HADOOP_NON_SECURE, but note that the profile 
> that uses this is hadoop_non_secure, which fails to compile on trunk: 
> https://issues.apache.org/jira/browse/GIRAPH-167 .
> * I couldn't get -Phadoop_facebook to work; does this work outside of 
> Facebook?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-168) Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than HADOOP_FACEBOOK) and remove usage of HADOOP

2012-04-05 Thread Eugene Koontz (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247396#comment-13247396
 ] 

Eugene Koontz commented on GIRAPH-168:
--

Hi Jakob, I wonder if HADOOP_NO_SASL might be better than HADOOP_OLDRPC (since 
the divergence in RPC has to do with HADOOP-6419 ("Change RPC layer to support 
SASL based mutual authentication"))?



> Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than 
> HADOOP_FACEBOOK) and remove usage of HADOOP
> -
>
> Key: GIRAPH-168
> URL: https://issues.apache.org/jira/browse/GIRAPH-168
> Project: Giraph
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Eugene Koontz
>Assignee: Eugene Koontz
> Attachments: GIRAPH-168.patch, GIRAPH-168.patch, GIRAPH-168.patch
>
>
> This JIRA relates to the mail thread here: 
> http://mail-archives.apache.org/mod_mbox/incubator-giraph-dev/201203.mbox/browser
> Currently we check for the munge flags HADOOP, HADOOP_FACEBOOK and 
> HADOOP_NON_SECURE when using munge in a few places. Hopefully we can 
> eliminate usage of munge in the future, but until then, we can mitigate the 
> complexity by consolidating the number of flags checked. This JIRA renames 
> HADOOP_FACEBOOK to HADOOP_SECURE, and removes usages of HADOOP, to handle the 
> same conditional compilation requirements. It also makes it easier to add 
> more maven profiles so that we can easily increase our hadoop version 
> coverage.
> This patch modifies the existing hadoop_facebook profile to use the new 
> HADOOP_SECURE munge flag, rather than HADOOP_FACEBOOK.
> It also adds a new hadoop maven profile, hadoop_trunk, which also sets 
> HADOOP_SECURE. 
> Finally, it adds a default profile, hadoop_0.20.203. This is needed so that 
> we can specify its dependencies separately from hadoop_trunk, because the 
> hadoop dependencies have changed between trunk and 0.205.0 - the former 
> requires hadoop-common, hadoop-mapreduce-client-core, and 
> hadoop-mapreduce-client-common, whereas the latter requires hadoop-core. 
> With this patch, the following passes:
> {code}
> mvn clean verify && mvn -Phadoop_trunk clean verify && mvn -Phadoop_0.20.203 
> clean verify
> {code}
> Current problems: 
> * I left in place the usage of HADOOP_NON_SECURE, but note that the profile 
> that uses this is hadoop_non_secure, which fails to compile on trunk: 
> https://issues.apache.org/jira/browse/GIRAPH-167 .
> * I couldn't get -Phadoop_facebook to work; does this work outside of 
> Facebook?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: On helping new contributors pitch in quickly...

2012-04-05 Thread Avery Ching

Here is a related JIRA https://issues.apache.org/jira/browse/GIRAPH-155

Avery

On 4/5/12 9:45 AM, Paolo Castagna wrote:

Hi Dan,
I have not an answer to your questions/observations yet.

However, I suspect N-Triples | N-Quads might not be the best option for
something like Giraph. Something more like an adjacency list might be
better.

So, my intuition, is that if you start with RDF in N-Triples format,
the first step would be a simple MapReduce job to group RDF statements
by subject (eventually filtering out certain properties):

Input:

   s1 --p1-->  o1
   s1 --p2-->  o2
   s1 --p2-->  o3
   s2 ...

Output (adjacency list):

   s1 (p1 o1) (p2 o2) (p2 o3)
   s2 ...

But, as I said, is it too early for me to say definitely this is the
best approach.

Paolo

Dan Brickley wrote:

On 5 April 2012 05:49, Jakob Homan  wrote:

Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
them, which itself is a good thing.  I'll get some new ones added
first thing in the morning.  Sorry.

Do we have something around "document a workflow to get RDF graph data
into Giraph?". A few of us have been talking about it here or there,
and I've heard various strategies mentioned (e.g. Ntriples as it's a
simple line-oriented format; piggybacking on HBase or other storage
that Giraph already has adaptors for; integrating Apache Jena; ...). I
can't find much in JIRA but
https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
(since we can't currently easily represent fully general RDF graphs
since two nodes might be connected by more than one typed edge). Even
without multigraphs it ought to be possible to bring RDF-sourced data
into Giraph, e.g. perhaps some app is only interested in say the
Movies + People subset of a big RDF collection. And so perhaps most of
the work is in preprocessing for now - e.g. via Ntriples + Pig; but
still it would be great to have a clear HOWTO.

As an interested party on the periphery, a JIRA for this would give a
natural place to monitor, read up, maybe even help. And I'm sure I'm
not alone...

cheers,

Dan




Re: On helping new contributors pitch in quickly...

2012-04-05 Thread Paolo Castagna
Hi Dan,
I have not an answer to your questions/observations yet.

However, I suspect N-Triples | N-Quads might not be the best option for
something like Giraph. Something more like an adjacency list might be
better.

So, my intuition, is that if you start with RDF in N-Triples format,
the first step would be a simple MapReduce job to group RDF statements
by subject (eventually filtering out certain properties):

Input:

  s1 --p1--> o1
  s1 --p2--> o2
  s1 --p2--> o3
  s2 ...

Output (adjacency list):

  s1 (p1 o1) (p2 o2) (p2 o3)
  s2 ...

But, as I said, is it too early for me to say definitely this is the
best approach.

Paolo

Dan Brickley wrote:
> On 5 April 2012 05:49, Jakob Homan  wrote:
>> Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
>> them, which itself is a good thing.  I'll get some new ones added
>> first thing in the morning.  Sorry.
> 
> Do we have something around "document a workflow to get RDF graph data
> into Giraph?". A few of us have been talking about it here or there,
> and I've heard various strategies mentioned (e.g. Ntriples as it's a
> simple line-oriented format; piggybacking on HBase or other storage
> that Giraph already has adaptors for; integrating Apache Jena; ...). I
> can't find much in JIRA but
> https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
> (since we can't currently easily represent fully general RDF graphs
> since two nodes might be connected by more than one typed edge). Even
> without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the
> Movies + People subset of a big RDF collection. And so perhaps most of
> the work is in preprocessing for now - e.g. via Ntriples + Pig; but
> still it would be great to have a clear HOWTO.
> 
> As an interested party on the periphery, a JIRA for this would give a
> natural place to monitor, read up, maybe even help. And I'm sure I'm
> not alone...
> 
> cheers,
> 
> Dan


Re: On helping new contributors pitch in quickly...

2012-04-05 Thread Avery Ching
Dan, you're definitely right that this has been mentioned a few times.  
The multigraph issue is one part of it, but a helper VertexInputFormat 
(and maybe VertexOutputFormat) would certainly still help as you 
mention.  Can you please open a JIRA (and help if you have time)?


Avery

On 4/5/12 1:49 AM, Dan Brickley wrote:

On 5 April 2012 05:49, Jakob Homan  wrote:

Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
them, which itself is a good thing.  I'll get some new ones added
first thing in the morning.  Sorry.

Do we have something around "document a workflow to get RDF graph data
into Giraph?". A few of us have been talking about it here or there,
and I've heard various strategies mentioned (e.g. Ntriples as it's a
simple line-oriented format; piggybacking on HBase or other storage
that Giraph already has adaptors for; integrating Apache Jena; ...). I
can't find much in JIRA but
https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
(since we can't currently easily represent fully general RDF graphs
since two nodes might be connected by more than one typed edge). Even
without multigraphs it ought to be possible to bring RDF-sourced data
into Giraph, e.g. perhaps some app is only interested in say the
Movies + People subset of a big RDF collection. And so perhaps most of
the work is in preprocessing for now - e.g. via Ntriples + Pig; but
still it would be great to have a clear HOWTO.

As an interested party on the periphery, a JIRA for this would give a
natural place to monitor, read up, maybe even help. And I'm sure I'm
not alone...

cheers,

Dan




Re: On helping new contributors pitch in quickly...

2012-04-05 Thread Dan Brickley
On 5 April 2012 05:49, Jakob Homan  wrote:
> Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
> them, which itself is a good thing.  I'll get some new ones added
> first thing in the morning.  Sorry.

Do we have something around "document a workflow to get RDF graph data
into Giraph?". A few of us have been talking about it here or there,
and I've heard various strategies mentioned (e.g. Ntriples as it's a
simple line-oriented format; piggybacking on HBase or other storage
that Giraph already has adaptors for; integrating Apache Jena; ...). I
can't find much in JIRA but
https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
(since we can't currently easily represent fully general RDF graphs
since two nodes might be connected by more than one typed edge). Even
without multigraphs it ought to be possible to bring RDF-sourced data
into Giraph, e.g. perhaps some app is only interested in say the
Movies + People subset of a big RDF collection. And so perhaps most of
the work is in preprocessing for now - e.g. via Ntriples + Pig; but
still it would be great to have a clear HOWTO.

As an interested party on the periphery, a JIRA for this would give a
natural place to monitor, read up, maybe even help. And I'm sure I'm
not alone...

cheers,

Dan


Re: Giraph as Whirr service, see WHIRR-530

2012-04-05 Thread Paolo Castagna
Thank you all for your comments.

There seems to be some interest and certainly agreement on just
"for testing"/"temporary" and the limits on cloud infrastructure
in relation to things as Hadoop, ZooKeeper and Giraph.

I also agree that, given Whirr can already spin Hadoop clusters,
user can run Giraph that way.

Whirr option might become more interesting in relation to YARN
and perhaps unit/integration testing (although, I am not sure
if/who is willing to put a credit card that). Fortunately, Giraph
tests run reasonably well and quickly locally.

Anyway, I'll keep an eye on WHIRR-530 and as I learn more about
Giraph and Whirr help that if I can. Personally, I am more
interested in YARN and Giraph than Giraph in its current shape.
Or, in orther words, in the future of Giraph rather than in the
past (i.e. backward compatibility/legacy) (although, I am aware
you have that in your mind as well and it seems to me there are
already Giraph users, so...)

Thanks,
Paolo

Paolo Castagna wrote:
> Hi,
> seen this?
> 
>   WHIRR-530 - Add Giraph as a service
>   https://issues.apache.org/jira/browse/WHIRR-530
> 
> This could be quite useful for users who want to give Giraph a spin on cloud
> infrastructure, just for testing or to run a few small experiments.
> My experience with Whirr an small 10-20 nodes clusters has be quite positive.
> Less so for larger clusters, but it more a problem/limit with the cloud
> provider rather than Whirr itself. I think.
> 
> Whirr makes extremely easy and pleasant deploy stuff on-demand.
> 
> ... and Whirr already supports YARN:
> https://issues.apache.org/jira/browse/WHIRR-391
> 
> Is any Giraph developers/users here also a Whirr user?
> 
> Paolo


Re: Giraph as Whirr service, see WHIRR-530

2012-04-05 Thread Dan McClary
Having used Whirr several times in EC2, it seems like a fine way to spin up
a temporary 'developers' cluster.  Zookeeper is the most likely source of
difficulty on VMs with limited I/O (i.e., it's very chatty and doesn't
tolerate the highly variable latency that smaller AMIs provide).  The HBase
community seems to be very aware of this; there's likely some tips and
tricks to be gleaned from reading their mailing lists.

-Dan

On Wed, Apr 4, 2012 at 11:08 PM, Brian Femiano  wrote:

> I've used it on clusters I started on EC2 launched by Whirr. Simply copy
> the fat
> jar to your client machine and it will distribute normally as a M/R
> dependency.
>
> It works very well.
>
> The only limitation I could potentially find (without much proof) was on
> VMs
> with limited IO the RPC message overhead between workers could be an issue.
> I never tried it on VMs with less than 'High' IO, so take that with a grain
> of salt.
>
> On Thu, Apr 5, 2012 at 12:51 AM, Jakob Homan  wrote:
>
> > This is interesting.  Whirr can already spin up Hadoop MR clusters,
> > which can then run the Giraph jobs.  Once Giraph is bootstrapped onto
> > YARN, this will make more sense as a Whirr service.
> >
> > On Wed, Apr 4, 2012 at 9:43 PM, Avery Ching  wrote:
> > > I don't use Whirr...I haven't heard it mentioned on this forum yet.
> >  Anyone?
> > >
> > > Avery
> > >
> > >
> > > On 4/4/12 9:30 PM, Paolo Castagna wrote:
> > >>
> > >> Hi,
> > >> seen this?
> > >>
> > >>   WHIRR-530 - Add Giraph as a service
> > >>   https://issues.apache.org/jira/browse/WHIRR-530
> > >>
> > >> This could be quite useful for users who want to give Giraph a spin on
> > >> cloud
> > >> infrastructure, just for testing or to run a few small experiments.
> > >> My experience with Whirr an small 10-20 nodes clusters has be quite
> > >> positive.
> > >> Less so for larger clusters, but it more a problem/limit with the
> cloud
> > >> provider rather than Whirr itself. I think.
> > >>
> > >> Whirr makes extremely easy and pleasant deploy stuff on-demand.
> > >>
> > >> ... and Whirr already supports YARN:
> > >> https://issues.apache.org/jira/browse/WHIRR-391
> > >>
> > >> Is any Giraph developers/users here also a Whirr user?
> > >>
> > >> Paolo
> > >
> > >
> >
>



-- 
Daniel McClary, Ph.D.
Visiting Scholar
*Amaral Lab and Department of Chemical and Biological Engineering,
Northwestern University*
Bioinformatics Specialist II
*Howard Hughes Medical Institute*
Email: dan.mccl...@northwestern.edu
Phone: (847) 491-1234
Web: http://amaral-lab.org/people/mcclary/
Mailing address:
2145 Sheridan Rd, Room E-136
Northwestern University
Evanston, IL 60208