Re: rya performance Problems

Jorge Machado Tue, 13 Feb 2018 06:57:34 -0800

Thanks ! will try that out



> On 13 Feb 2018, at 15:10, Puja Valiyil <[email protected]> wrote:
> 
> Yes.  There is a config parameter on the AccumuloRDFConfiguration that
> specifies whether or not to flush after every insert.  If you set this to
> "false", then the ingest times should improve.  If set to true, the flush
> method on the Accumulo MultitableBatchWriter is called after every each
> triple is inserted.
> 
> 
> On Tue, Feb 13, 2018 at 9:04 AM, Jorge Machado <[email protected]> wrote:
> 
>> From Accumulo BatchWriters ?
>> 
>> Jorge Machado
>> www.jmachado.me
>> 
>> 
>> 
>> 
>> 
>>> On 13 Feb 2018, at 13:13, Puja Valiyil <[email protected]> wrote:
>>> 
>>> Hey Jorge,
>>> There is a config value for flushing on insert— make sure that is set to
>> false.  If it is set to true, The writer to accumulo will flush after every
>> triple which would slow performance down.
>>> Hope this helps!
>>> Thanks,
>>> Puja
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Feb 13, 2018, at 3:43 AM, Jorge Machado <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Guys,
>>>> 
>>>> I just give it a test drive to rya over a spark job on aws with 5
>> accumulo instances.
>>>> the performance is really really slow getting only 2000 records per
>> second.
>>>> each commit to accumulo takes rought 20ms
>>>> 
>>>> Any kind of  trick here or did I miss something ?
>>>> 
>>>> Here is my code (I changed the code to run on aws of course):
>>>>> package template.spark
>>>>> 
>>>>> import java.io.File
>>>>> 
>>>>> import org.apache.accumulo.core.client.{ClientConfiguration,
>> Connector, ZooKeeperInstance}
>>>>> import org.apache.accumulo.core.client.security.tokens.PasswordToken
>>>>> import org.apache.rya.accumulo.{AccumuloRdfConfiguration,
>> AccumuloRyaDAO}
>>>>> import org.apache.rya.rdftriplestore.{RdfCloudTripleStore,
>> RyaSailRepository}
>>>>> import org.openrdf.model.Resource
>>>>> import org.openrdf.model.impl.ValueFactoryImpl
>>>>> import org.openrdf.repository.sail.SailRepositoryConnection
>>>>> 
>>>>> 
>>>>> final case class Person(firstName: String, lastName: String,
>>>>>                       country: String, age: Int)
>>>>> 
>>>>> object Main extends InitSpark {
>>>>> def main(args: Array[String]) = {
>>>>>   import spark.implicits._
>>>>> 
>>>>>   val version = spark.version
>>>>>   val dataset = spark.read.parquet("/Users/jorge/Downloads/test-
>> d4852e42712.gz.parquet")
>>>>>     dataset.foreachPartition(itr => {
>>>>>         System.out.println(System.currentTimeMillis()+ " Starting to
>> get connector")
>>>>>         val conn = ryaConnection.getConnection
>>>>>         val a = itr.flatMap(row=>{
>>>>>             row.schema.map(field=>{
>>>>>                 val any  = row.get(row.schema.fieldIndex(field.name))
>>>>>                 val subject = ryaConnection.vf.createURI("urn:fdc:
>> gfk.com:19980923:mySubject")
>>>>>                 val predicate = ryaConnection.vf.createURI("urn:fdc:
>> gfk.com:"+field.name)
>>>>>                 val obj = ryaConnection.vf.
>> createLiteral(any.toString)
>>>>>                 ryaConnection.vf.createStatement(subject,
>> predicate,obj)
>>>>>             })
>>>>>         })
>>>>>         System.out.println(System.currentTimeMillis()+ " Start
>> writting data")
>>>>>         a.foreach(conn.add(_))
>>>>>         System.out.println("Finished Partition")
>>>>>         conn.close()
>>>>>     })
>>>>> }
>>>>> 
>>>>> 
>>>>>   object ryaConnection{
>>>>>       val vf = new ValueFactoryImpl()
>>>>>       val store = new RdfCloudTripleStore()
>>>>>       val conf = new AccumuloRdfConfiguration()
>>>>>       conf.addResource(new File("/Users/jorge/Downloads/
>> hdp/HDFS_CLIENT-configs/hdfs-site.xml").toURI.toURL)
>>>>>       conf.addResource(new File("/Users/jorge/Downloads/
>> hdp/HDFS_CLIENT-configs/core-site.xml").toURI.toURL)
>>>>>       conf.addResource(new File("/Users/jorge/Downloads/
>> hdp/ACCUMULO_CLIENT-configs/accumulo-site.xml").toURI.toURL)
>>>>>       //conf.addResource(new File("/Users/jorge/Downloads/
>> hdp/ACCUMULO_CLIENT-configs/client.conf").toURI.toURL)
>>>>>       val dao = new AccumuloRyaDAO()
>>>>>       val pass = new PasswordToken("accumulo")
>>>>>       val connector: Connector = new 
>>>>> ZooKeeperInstance("hdp-accumulo-instance",
>> "sandbox-hdp.hortonworks.com:2181")
>>>>>               .getConnector("root",pass)
>>>>>       System.out.println("got Connector")
>>>>>       dao.setConnector(connector)
>>>>>       conf.setTablePrefix("rya_")
>>>>>       dao.setConf(conf)
>>>>>       store.setRyaDAO(dao)
>>>>>       val myRepository = new RyaSailRepository(store)
>>>>>       myRepository.initialize()
>>>>> 
>>>>>       def getConnection: SailRepositoryConnection ={
>>>>>            myRepository.getConnection
>>>>>       }
>>>>> 
>>>>>   }
>>>>> 
>>>>> }
>>>>> Jorge
>>>>> 
>>>>> 
>>>> <logs.zip>
>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: rya performance Problems

Reply via email to