Re: Loading tab spaced data

Matteo Cossu Tue, 29 Aug 2017 08:12:37 -0700

1 Billion triples. I don't have a stack trace and but I'll try to get one
next week.
Now I'm worried that the problem was caused by some mistake of mine and I'm
stealing your time :)
Anyway, I don't convert anything from parquet because it's managed by
hadoop, and the file is in N-Triples format already.



On 29 August 2017 at 16:53, Meier, Caleb <[email protected]> wrote:

> Hey Matteo,
>
> Do you know offhand how many triples are included in your dataset?  Also,
> can you send a stack trace?  How are you converting your Parquet file to
> one of the formats supported by the RdfIngestTool (n-triples, trig, ...)?
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> [email protected] ♦ www.parsons.com
>
> -----Original Message-----
> From: Matteo Cossu [mailto:[email protected]]
> Sent: Tuesday, August 29, 2017 10:36 AM
> To: [email protected]
> Subject: Re: Loading tab spaced data
>
> Hello Caleb,
> I was trying to load a 53GB file (in parquet format) with 10 containers
> with assigned 15GB of memory each.
> Does someone have some reference numbers, like how how big a dataset can
> be with these resources?
> This could help me to know when the problem is entirely mine, that is
> probable since with many of the tools I'm using (accumulo for example) I'm
> still a novice.
>
> Thank you all for the answers,
> Matteo Cossu
>
>
> On 29 August 2017 at 16:12, Meier, Caleb <[email protected]> wrote:
>
> > Hello Matteo,
> >
> > Were you using the MapReduce ingest tool when you were running out of
> > memory?  If so, do you know big the file was that you were ingesting,
> > how many containers Yarn allocated to your job, and how much memory
> > was allocated to each container?
> >
> > Caleb A. Meier, Ph.D.
> > Senior Software Engineer ♦ Analyst
> > Parsons Corporation
> > 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> > Office:  (703)797-3066
> > [email protected] ♦ www.parsons.com
> >
> > -----Original Message-----
> > From: Matteo Cossu [mailto:[email protected]]
> > Sent: Monday, August 28, 2017 8:04 PM
> > To: [email protected]
> > Subject: Re: Loading tab spaced data
> >
> > I would like to help, but I still can't even test Rya properly. I'm
> > developing for research a similar system (using Spark SQL) and I
> > wanted to compare my software performances with Rya on the University
> Cluster.
> > When I try to use these Rya tools for loading the data with the big
> > datasets, it always crashes (mostly out of memory problems) and it
> > doesn't complete the loading. At the moment, I have the urgency of
> > publishing some results, so I am comparing my software with other
> systems.
> > Later, I could go back on Rya and try to solve some bugs along the way
> > :P
> >
> > Best Regards,
> > Matteo Cossu
> >
> > On 29 August 2017 at 01:29, Josh Elser <[email protected]> wrote:
> >
> > > Hi Matteo,
> > >
> > > Thanks for the bug-report. Do you have an interest in making the
> > > change to Rya to address this issue? :)
> > >
> > > In open source projects, we like to encourage users to make changes
> > > to "scratch their own itch". Please let us know how we can help
> > > enable you to make this change.
> > >
> > > On 8/25/17 8:45 AM, Matteo Cossu wrote:
> > >
> > >> Hello,
> > >> I have some problems in loading the data with the Map Reduce code
> > >> provided.
> > >> I am using this class:
> > >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
> > >> .*
> > >> When my input data is in N-Triples format and the triples are tab
> > >> separated instead of spaces, I get this error:
> > >>
> > >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I
> > >> solved by substituting all the tabs with spaces in my input data,
> > >> but since tabs are a possible separator in the N-Triples format, I
> > >> think this should be implemented (or fixed) directly within the tool.
> > >>
> > >> Kind Regards,
> > >> Matteo Cossu
> > >>
> > >>
> >
>

Re: Loading tab spaced data

Reply via email to