1 Billion triples. I don't have a stack trace and but I'll try to get one next week. Now I'm worried that the problem was caused by some mistake of mine and I'm stealing your time :) Anyway, I don't convert anything from parquet because it's managed by hadoop, and the file is in N-Triples format already.
On 29 August 2017 at 16:53, Meier, Caleb <[email protected]> wrote: > Hey Matteo, > > Do you know offhand how many triples are included in your dataset? Also, > can you send a stack trace? How are you converting your Parquet file to > one of the formats supported by the RdfIngestTool (n-triples, trig, ...)? > > Caleb A. Meier, Ph.D. > Senior Software Engineer ♦ Analyst > Parsons Corporation > 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209 > Office: (703)797-3066 > [email protected] ♦ www.parsons.com > > -----Original Message----- > From: Matteo Cossu [mailto:[email protected]] > Sent: Tuesday, August 29, 2017 10:36 AM > To: [email protected] > Subject: Re: Loading tab spaced data > > Hello Caleb, > I was trying to load a 53GB file (in parquet format) with 10 containers > with assigned 15GB of memory each. > Does someone have some reference numbers, like how how big a dataset can > be with these resources? > This could help me to know when the problem is entirely mine, that is > probable since with many of the tools I'm using (accumulo for example) I'm > still a novice. > > Thank you all for the answers, > Matteo Cossu > > > On 29 August 2017 at 16:12, Meier, Caleb <[email protected]> wrote: > > > Hello Matteo, > > > > Were you using the MapReduce ingest tool when you were running out of > > memory? If so, do you know big the file was that you were ingesting, > > how many containers Yarn allocated to your job, and how much memory > > was allocated to each container? > > > > Caleb A. Meier, Ph.D. > > Senior Software Engineer ♦ Analyst > > Parsons Corporation > > 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209 > > Office: (703)797-3066 > > [email protected] ♦ www.parsons.com > > > > -----Original Message----- > > From: Matteo Cossu [mailto:[email protected]] > > Sent: Monday, August 28, 2017 8:04 PM > > To: [email protected] > > Subject: Re: Loading tab spaced data > > > > I would like to help, but I still can't even test Rya properly. I'm > > developing for research a similar system (using Spark SQL) and I > > wanted to compare my software performances with Rya on the University > Cluster. > > When I try to use these Rya tools for loading the data with the big > > datasets, it always crashes (mostly out of memory problems) and it > > doesn't complete the loading. At the moment, I have the urgency of > > publishing some results, so I am comparing my software with other > systems. > > Later, I could go back on Rya and try to solve some bugs along the way > > :P > > > > Best Regards, > > Matteo Cossu > > > > On 29 August 2017 at 01:29, Josh Elser <[email protected]> wrote: > > > > > Hi Matteo, > > > > > > Thanks for the bug-report. Do you have an interest in making the > > > change to Rya to address this issue? :) > > > > > > In open source projects, we like to encourage users to make changes > > > to "scratch their own itch". Please let us know how we can help > > > enable you to make this change. > > > > > > On 8/25/17 8:45 AM, Matteo Cossu wrote: > > > > > >> Hello, > > >> I have some problems in loading the data with the Map Reduce code > > >> provided. > > >> I am using this class: > > >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool > > >> .* > > >> When my input data is in N-Triples format and the triples are tab > > >> separated instead of spaces, I get this error: > > >> > > >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I > > >> solved by substituting all the tabs with spaces in my input data, > > >> but since tabs are a possible separator in the N-Triples format, I > > >> think this should be implemented (or fixed) directly within the tool. > > >> > > >> Kind Regards, > > >> Matteo Cossu > > >> > > >> > > >
