Re: Data partitioning and node tracking in Spark-GraphX

2015-05-17 Thread MUHAMMAD AAMIR
Can you please elaborate the way to fetch the records from a particular
partition (node in our case) For example, my RDD is distributed to 10 nodes
and i want to fetch the data of one particular node/partition  i.e.
partition/node with index 5.
How can i do this?
I have tried mapPartitionswithIndex as well as partitions.foreach
functions. However, these are expensive. Does any body know more efficient
way ?

Thanks in anticipation.


On Thu, Apr 16, 2015 at 5:49 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 Well you can have a two level index structure, still without any need for
 physical cluster node awareness



 Level 1 Index is the previously described partitioned [K,V] RDD – this
 gets you to the value (RDD element) you need on the respective cluster node



 Level 2 Index – it will be built and reside within the Value of each [K,V]
 RDD element – so after you retrieve the appropriate Element from the
 appropriate cluster node based on Level 1 Index, then you query the Value
 in the element based on Level 2 Index



 *From:* MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com]
 *Sent:* Thursday, April 16, 2015 4:32 PM

 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Data partitioning and node tracking in Spark-GraphX



 Thanks a lot for the reply. Indeed it is useful but to be more precise i
 have 3D data and want to index it using octree. Thus i aim to build a two
 level indexing mechanism i.e. First at global level i want to partition and
 send the data to the nodes then at node level i again want to use octree to
 inded my data at local level.

 Could you please elaborate the solution in this context ?



 On Thu, Apr 16, 2015 at 5:23 PM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 Well you can use a [Key, Value] RDD and partition it based on hash
 function on the Key and even a specific number of partitions (and hence
 cluster nodes). This will a) index the data, b) divide it and send it to
 multiple nodes. Re your last requirement - in a cluster programming
 environment/framework your app code should not be bothered on which
 physical node exactly, a partition resides



 Regards

 Evo Eftimov



 *From:* MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com]
 *Sent:* Thursday, April 16, 2015 4:20 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Data partitioning and node tracking in Spark-GraphX



 I want to use Spark functions/APIs to do this task. My basic purpose is to
 index the data and divide and send it to multiple nodes. Then at the time
 of accessing i want to reach the right node and data partition. I don't
 have any clue how to do this.

 Thanks,



 On Thu, Apr 16, 2015 at 5:13 PM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 How do you intend to fetch the required data - from within Spark or using
 an app / code / module outside Spark

 -Original Message-
 From: mas [mailto:mas.ha...@gmail.com]
 Sent: Thursday, April 16, 2015 4:08 PM
 To: user@spark.apache.org
 Subject: Data partitioning and node tracking in Spark-GraphX

 I have a big data file, i aim to create index on the data. I want to
 partition the data based on user defined function in Spark-GraphX (Scala).
 Further i want to keep track the node on which a particular data partition
 is send and being processed so i could fetch the required data by accessing
 the right node and data partition.
 How can i achieve this?
 Any help in this regard will be highly appreciated.



 --
 View this message in context:

 http://apache-spark-user-list.1001560.n3.nabble.com/Data-partitioning-and-no
 de-tracking-in-Spark-GraphX-tp22527.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org





 --

 Regards,
 Muhammad Aamir


 *CONFIDENTIALITY:This email is intended solely for the person(s) named and
 may be confidential and/or privileged.If you are not the intended
 recipient,please delete it,notify me and do not copy,use,or disclose its
 content.*





 --

 Regards,
 Muhammad Aamir


 *CONFIDENTIALITY:This email is intended solely for the person(s) named and
 may be confidential and/or privileged.If you are not the intended
 recipient,please delete it,notify me and do not copy,use,or disclose its
 content.*




-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*


Re: How does GraphX stores the routing table?

2015-04-22 Thread MUHAMMAD AAMIR
Hi Ankur,
Thanks for the answer. However i still have following queries.

On Wed, Apr 22, 2015 at 8:39 AM, Ankur Dave ankurd...@gmail.com wrote:

 On Tue, Apr 21, 2015 at 10:39 AM, mas mas.ha...@gmail.com wrote:

 How does GraphX stores the routing table? Is it stored on the master node
 or
 chunks of the routing table is send to each partition that maintains the
 record of vertices and edges at that node?


 The latter: the routing table is stored alongside the vertices, and for
 each vertex it stores the set of edge partitions that reference that
 vertex.


*Then how the master node tracks that where(in which partition) a
particular vertex and edge is?*


*Further, does it mean that to fetch a particular edge we first have to
find its source or destination vertex  ?*



 If only customized edge partitioning is performed will the corresponding
 vertices be sent to same partition or not ?


 If I understand correctly, you're asking whether it's possible to colocate
 the vertices with the edges so they don't have to move during replication.
 It's possible to do this in some cases by partitioning each edge based on a
 hash partitioner of its source or destination vertex. GraphX will still do
 replication using a shuffle, but most of the shuffle files should be local
 in this case.

 I tried this a while ago but didn't find a very big improvement for
 PageRank. Ultimately a more general solution would be to unify the vertex
 and edge RDDs by designating one replica for each vertex as the master.
 This would also reduce the storage cost by a factor of (average degree -
 1)/(average degree).


*What do you exactly mean here by desingating one replica for each vertex
as the master ? How can we perform this ?*


 Ankur http://www.ankurdave.com/




-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*


Re: Custom Partitioning Spark

2015-04-21 Thread MUHAMMAD AAMIR
Hi Archit,

Thanks a lot for your reply. I am using rdd.partitions.length to check
the number of partitions. rdd.partitions return the array of partitions.
I would like to add one more question here do you have any idea how to get
the objects in each partition ? Further is there any way to figure out
which particular partitions an object bleongs ?

Thanks,

On Tue, Apr 21, 2015 at 12:16 PM, Archit Thakur archit279tha...@gmail.com
wrote:

 Hi,

 This should work. How are you checking the no. of partitions.?

 Thanks and Regards,
 Archit Thakur.

 On Mon, Apr 20, 2015 at 7:26 PM, mas mas.ha...@gmail.com wrote:

 Hi,

 I aim to do custom partitioning on a text file. I first convert it into
 pairRDD and then try to use my custom partitioner. However, somehow it is
 not working. My code snippet is given below.

 val file=sc.textFile(filePath)
 val locLines=file.map(line = line.split(\t)).map(line=
 ((line(2).toDouble,line(3).toDouble),line(5).toLong))
 val ck=locLines.partitionBy(new HashPartitioner(50)) // new
 CustomPartitioner(50) -- none of the way is working here.

 while reading the file using textFile method it automatically partitions
 the file. However when i explicitly want to partition the new rdd
 locLines, It doesn't appear to do anything and even the number of
 partitions are same which is created by sc.textFile().

 Any help in this regard will be highly appreciated.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Partitioning-Spark-tp22571.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*


Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
Thanks a lot for the reply. Indeed it is useful but to be more precise i
have 3D data and want to index it using octree. Thus i aim to build a two
level indexing mechanism i.e. First at global level i want to partition and
send the data to the nodes then at node level i again want to use octree to
inded my data at local level.
Could you please elaborate the solution in this context ?

On Thu, Apr 16, 2015 at 5:23 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 Well you can use a [Key, Value] RDD and partition it based on hash
 function on the Key and even a specific number of partitions (and hence
 cluster nodes). This will a) index the data, b) divide it and send it to
 multiple nodes. Re your last requirement - in a cluster programming
 environment/framework your app code should not be bothered on which
 physical node exactly, a partition resides



 Regards

 Evo Eftimov



 *From:* MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com]
 *Sent:* Thursday, April 16, 2015 4:20 PM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: Data partitioning and node tracking in Spark-GraphX



 I want to use Spark functions/APIs to do this task. My basic purpose is to
 index the data and divide and send it to multiple nodes. Then at the time
 of accessing i want to reach the right node and data partition. I don't
 have any clue how to do this.

 Thanks,



 On Thu, Apr 16, 2015 at 5:13 PM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 How do you intend to fetch the required data - from within Spark or using
 an app / code / module outside Spark

 -Original Message-
 From: mas [mailto:mas.ha...@gmail.com]
 Sent: Thursday, April 16, 2015 4:08 PM
 To: user@spark.apache.org
 Subject: Data partitioning and node tracking in Spark-GraphX

 I have a big data file, i aim to create index on the data. I want to
 partition the data based on user defined function in Spark-GraphX (Scala).
 Further i want to keep track the node on which a particular data partition
 is send and being processed so i could fetch the required data by accessing
 the right node and data partition.
 How can i achieve this?
 Any help in this regard will be highly appreciated.



 --
 View this message in context:

 http://apache-spark-user-list.1001560.n3.nabble.com/Data-partitioning-and-no
 de-tracking-in-Spark-GraphX-tp22527.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org





 --

 Regards,
 Muhammad Aamir


 *CONFIDENTIALITY:This email is intended solely for the person(s) named and
 may be confidential and/or privileged.If you are not the intended
 recipient,please delete it,notify me and do not copy,use,or disclose its
 content.*




-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*


Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
I want to use Spark functions/APIs to do this task. My basic purpose is to
index the data and divide and send it to multiple nodes. Then at the time
of accessing i want to reach the right node and data partition. I don't
have any clue how to do this.
Thanks,

On Thu, Apr 16, 2015 at 5:13 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 How do you intend to fetch the required data - from within Spark or using
 an app / code / module outside Spark

 -Original Message-
 From: mas [mailto:mas.ha...@gmail.com]
 Sent: Thursday, April 16, 2015 4:08 PM
 To: user@spark.apache.org
 Subject: Data partitioning and node tracking in Spark-GraphX

 I have a big data file, i aim to create index on the data. I want to
 partition the data based on user defined function in Spark-GraphX (Scala).
 Further i want to keep track the node on which a particular data partition
 is send and being processed so i could fetch the required data by accessing
 the right node and data partition.
 How can i achieve this?
 Any help in this regard will be highly appreciated.



 --
 View this message in context:

 http://apache-spark-user-list.1001560.n3.nabble.com/Data-partitioning-and-no
 de-tracking-in-Spark-GraphX-tp22527.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org





-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*


Re: Incremently load big RDD file into Memory

2015-04-09 Thread MUHAMMAD AAMIR
Hi,

Thanks a lot for such a detailed response.

On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel guillaume.pi...@exensa.com
wrote:

  Hi Muhammad,

 There are lots of ways to do it. My company actually develops a text
 mining solution which embeds a very fast Approximate Neighbours solution (a
 demo with real time queries on the wikipedia dataset can be seen at
 wikinsights.org). For the record, we now prepare a dataset of 4.5 million
 documents for querying in about 2 or 3 minutes on a 32 cores cluster, and
 the queries take less than 10ms when the dataset is in memory.

 But if you just want to precompute everything and don't mind waiting a few
 tens of minutes (or hours), and don't want to bother with an approximate
 neighbour solution, then the best way is probably something like this :

 1 - block your data (i.e. group your items in X large groups). Instead of
 a dataset of N elements, you should now have a dataset of X blocks
 containing N/X elements each.
 2 - do the cartesian product (instead of N*N elements, you now have just
 X*X blocks, which should take less memory)
 3 - for each pair of blocks (blockA,blockB), perform the computation of
 distances for each elements of blockA with each element of blockB, but keep
 only the top K best for each element of blockA. Output is
 List((elementOfBlockA, listOfKNearestElementsOfBlockBWithTheDistance),..)
 4 - reduceByKey (the key is the elementOfBlockA), by merging the
 listOfNearestElements and always keeping the K nearest.

 This is an exact version of top K. This is only interesting if K  N/X.
 But even if K is large, it is possible that it will fit your needs.
 Remember that you will still compute N*N distances (this is the problem
 with exact nearest neighbours), the only difference with what you're doing
 now is that you produces less items and duplicates less data. Indeed, if
 one of your elements takes 100bytes, the per element cartesian will produce
 N*N*100*2 bytes, while the blocked version will produce X*X*100*2*N/X, ie
 X*N*100*2 bytes.

 Guillaume

 Hi Guillaume,

  Thanks for you reply. Can you please tell me how can i improve for Top-k
 nearest points.

  P.S. My post is not accepted on the list thats why i am sending you
 email here.
 I would be really grateful to you if you reply it.
 Thanks,

 On Wed, Apr 8, 2015 at 1:23 PM, Guillaume Pitel 
 guillaume.pi...@exensa.com wrote:

  This kind of operation is not scalable, not matter what you do, at
 least if you _really_ want to do that.

 However, if what you're looking for is not to really compute all
 distances, (for instance if you're looking only for the top K nearest
 points), then it can be highly improved.

 It all depends of what you want to do eventually.

 Guillaume

 val locations = filelines.map(line = line.split(\t)).map(t =
 (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect()

 val cartesienProduct=locations.cartesian(locations).map(t=
 Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2)))

 Code executes perfectly fine uptill here but when i try to use
 cartesienProduct it got stuck i.e.

 val count =cartesienProduct.count()

 Any help to efficiently do this will be highly appreciated.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Incremently-load-big-RDD-file-into-Memory-tp22410.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 --
[image: eXenSa]
  *Guillaume PITEL, Président*
 +33(0)626 222 431

 eXenSa S.A.S. http://www.exensa.com/
  41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705




  --
  Regards,
 Muhammad Aamir


 *CONFIDENTIALITY:This email is intended solely for the person(s) named and
 may be confidential and/or privileged.If you are not the intended
 recipient,please delete it,notify me and do not copy,use,or disclose its
 content.*



 --
[image: eXenSa]
  *Guillaume PITEL, Président*
 +33(0)626 222 431

 eXenSa S.A.S. http://www.exensa.com/
  41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705




-- 
Regards,
Muhammad Aamir


*CONFIDENTIALITY:This email is intended solely for the person(s) named and
may be confidential and/or privileged.If you are not the intended
recipient,please delete it,notify me and do not copy,use,or disclose its
content.*