Re: Loading tab spaced data

Matteo Cossu Thu, 31 Aug 2017 02:36:59 -0700

Hello,
I fixed the problem with my loading, I did not give enough memory to
Accumulo servers!
Because I am still in time, I would like to use Rya for comparison in my
research.
I read I should use the Prospects Table, do you have any other suggestions
on how to get the best querying performances?
The results could be interesting for Rya. I am using the WatDiv
<http://dsg.uwaterloo.ca/watdiv/>test suite that contains several types of
queries, so it will be easier to identify where Rya performs better and
where (or if) there is room for improvement.


Best Regards,
Matteo Cossu

On 29 August 2017 at 17:12, Matteo Cossu <[email protected]> wrote:

> 1 Billion triples. I don't have a stack trace and but I'll try to get one
> next week.
> Now I'm worried that the problem was caused by some mistake of mine and
> I'm stealing your time :)
> Anyway, I don't convert anything from parquet because it's managed by
> hadoop, and the file is in N-Triples format already.
>
>
> On 29 August 2017 at 16:53, Meier, Caleb <[email protected]> wrote:
>
>> Hey Matteo,
>>
>> Do you know offhand how many triples are included in your dataset?  Also,
>> can you send a stack trace?  How are you converting your Parquet file to
>> one of the formats supported by the RdfIngestTool (n-triples, trig, ...)?
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> [email protected] ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Matteo Cossu [mailto:[email protected]]
>> Sent: Tuesday, August 29, 2017 10:36 AM
>> To: [email protected]
>> Subject: Re: Loading tab spaced data
>>
>> Hello Caleb,
>> I was trying to load a 53GB file (in parquet format) with 10 containers
>> with assigned 15GB of memory each.
>> Does someone have some reference numbers, like how how big a dataset can
>> be with these resources?
>> This could help me to know when the problem is entirely mine, that is
>> probable since with many of the tools I'm using (accumulo for example) I'm
>> still a novice.
>>
>> Thank you all for the answers,
>> Matteo Cossu
>>
>>
>> On 29 August 2017 at 16:12, Meier, Caleb <[email protected]> wrote:
>>
>> > Hello Matteo,
>> >
>> > Were you using the MapReduce ingest tool when you were running out of
>> > memory?  If so, do you know big the file was that you were ingesting,
>> > how many containers Yarn allocated to your job, and how much memory
>> > was allocated to each container?
>> >
>> > Caleb A. Meier, Ph.D.
>> > Senior Software Engineer ♦ Analyst
>> > Parsons Corporation
>> > 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> > Office:  (703)797-3066
>> > [email protected] ♦ www.parsons.com
>> >
>> > -----Original Message-----
>> > From: Matteo Cossu [mailto:[email protected]]
>> > Sent: Monday, August 28, 2017 8:04 PM
>> > To: [email protected]
>> > Subject: Re: Loading tab spaced data
>> >
>> > I would like to help, but I still can't even test Rya properly. I'm
>> > developing for research a similar system (using Spark SQL) and I
>> > wanted to compare my software performances with Rya on the University
>> Cluster.
>> > When I try to use these Rya tools for loading the data with the big
>> > datasets, it always crashes (mostly out of memory problems) and it
>> > doesn't complete the loading. At the moment, I have the urgency of
>> > publishing some results, so I am comparing my software with other
>> systems.
>> > Later, I could go back on Rya and try to solve some bugs along the way
>> > :P
>> >
>> > Best Regards,
>> > Matteo Cossu
>> >
>> > On 29 August 2017 at 01:29, Josh Elser <[email protected]> wrote:
>> >
>> > > Hi Matteo,
>> > >
>> > > Thanks for the bug-report. Do you have an interest in making the
>> > > change to Rya to address this issue? :)
>> > >
>> > > In open source projects, we like to encourage users to make changes
>> > > to "scratch their own itch". Please let us know how we can help
>> > > enable you to make this change.
>> > >
>> > > On 8/25/17 8:45 AM, Matteo Cossu wrote:
>> > >
>> > >> Hello,
>> > >> I have some problems in loading the data with the Map Reduce code
>> > >> provided.
>> > >> I am using this class:
>> > >> *org.apache.rya.accumulo.mr.tools.RdfFileInputTool
>> > >> .*
>> > >> When my input data is in N-Triples format and the triples are tab
>> > >> separated instead of spaces, I get this error:
>> > >>
>> > >> *org.openrdf.rio.RDFParseException: Expected '<', found: m* I
>> > >> solved by substituting all the tabs with spaces in my input data,
>> > >> but since tabs are a possible separator in the N-Triples format, I
>> > >> think this should be implemented (or fixed) directly within the tool.
>> > >>
>> > >> Kind Regards,
>> > >> Matteo Cossu
>> > >>
>> > >>
>> >
>>
>
>

Re: Loading tab spaced data

Reply via email to