RE: Loading tab spaced data

Meier, Caleb Tue, 29 Aug 2017 07:54:26 -0700

Hey Matteo, 

Do you know offhand how many triples are included in your dataset?  Also, can 
you send a stack trace?  How are you converting your Parquet file to one of the 
formats supported by the RdfIngestTool (n-triples, trig, ...)?


Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
[email protected] ♦ www.parsons.com

-----Original Message-----
From: Matteo Cossu [mailto:[email protected]] 
Sent: Tuesday, August 29, 2017 10:36 AM
To: [email protected]
Subject: Re: Loading tab spaced data

Hello Caleb,
I was trying to load a 53GB file (in parquet format) with 10 containers with 
assigned 15GB of memory each.
Does someone have some reference numbers, like how how big a dataset can be 
with these resources?
This could help me to know when the problem is entirely mine, that is probable 
since with many of the tools I'm using (accumulo for example) I'm still a 
novice.

Thank you all for the answers,
Matteo Cossu


On 29 August 2017 at 16:12, Meier, Caleb <[email protected]> wrote:

> Hello Matteo,
>
> Were you using the MapReduce ingest tool when you were running out of 
> memory?  If so, do you know big the file was that you were ingesting, 
> how many containers Yarn allocated to your job, and how much memory 
> was allocated to each container?
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> [email protected] ♦ www.parsons.com
>
> -----Original Message-----
> From: Matteo Cossu [mailto:[email protected]]
> Sent: Monday, August 28, 2017 8:04 PM
> To: [email protected]
> Subject: Re: Loading tab spaced data
>
> I would like to help, but I still can't even test Rya properly. I'm 
> developing for research a similar system (using Spark SQL) and I 
> wanted to compare my software performances with Rya on the University Cluster.
> When I try to use these Rya tools for loading the data with the big 
> datasets, it always crashes (mostly out of memory problems) and it 
> doesn't complete the loading. At the moment, I have the urgency of 
> publishing some results, so I am comparing my software with other systems.
> Later, I could go back on Rya and try to solve some bugs along the way 
> :P
>
> Best Regards,
> Matteo Cossu
>
> On 29 August 2017 at 01:29, Josh Elser <[email protected]> wrote:
>
> > Hi Matteo,
> >
> > Thanks for the bug-report. Do you have an interest in making the 
> > change to Rya to address this issue? :)
> >
> > In open source projects, we like to encourage users to make changes 
> > to "scratch their own itch". Please let us know how we can help 
> > enable you to make this change.
> >
> > On 8/25/17 8:45 AM, Matteo Cossu wrote:
> >
> >> Hello,
> >> I have some problems in loading the data with the Map Reduce code 
> >> provided.
> >> I am using this class:
> >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
> >> .*
> >> When my input data is in N-Triples format and the triples are tab 
> >> separated instead of spaces, I get this error:
> >>
> >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I 
> >> solved by substituting all the tabs with spaces in my input data, 
> >> but since tabs are a possible separator in the N-Triples format, I 
> >> think this should be implemented (or fixed) directly within the tool.
> >>
> >> Kind Regards,
> >> Matteo Cossu
> >>
> >>
>

RE: Loading tab spaced data

Reply via email to