Re: Sending large objects to specific RDDs

Ted Yu Sat, 16 Jan 2016 13:38:12 -0800

Both groupByKey and join() accept Partitioner as parameter.

Maybe you can specify a custom Partitioner so that the amount of shuffle is
reduced.


On Sat, Jan 16, 2016 at 9:39 AM, Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> Hi Ted,
>
> I think I might have figured something out!(Though I haven't tested it at
> scale yet)
>
> My current thought is that I can do a groupByKey on the RDD of vectors and
> then do a join with the invertedIndex.
> It would look something like this:
>
> val InvIndexes:RDD[(Int,InvertedIndex)]
> val partitionedVectors:RDD[(Int, Vector)]
>
> val partitionedTasks:RDD[(Int, (Iterator[Vector], InvertedIndex))] =
> partitionedvectors.groupByKey().join(invIndexes)
>
> val similarities = partitionedTasks.map(//calculate similarities)
> val maxSim = similarities.reduce(math.max)
>
>
> So while I realize that usually a groupByKey is usually frowned upon, it
> seems to me that since I need all associated vectors to be local anyways
> that this repartitioning would not be too expensive.
>
> Does this seem like a reasonable approach to this problem or are there any
> faults that I should consider should I approach it this way?
>
> Thank you for your help,
>
> Daniel
>
> On Fri, Jan 15, 2016 at 5:30 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
>> My knowledge of XSEDE is limited - I visited the website.
>>
>> If there is no easy way to deploy HBase, alternative approach (using hdfs
>> ?) needs to be considered.
>>
>> I need to do more homework on this :-)
>>
>> On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> Hi Ted,
>>>
>>> So unfortunately after looking into the cluster manager that I will be
>>> using for my testing (I'm using a super-computer called XSEDE rather than
>>> AWS), it looks like the cluster does not actually come with Hbase installed
>>> (this cluster is becoming somewhat problematic, as it is essentially AWS
>>> but you have to do your own virtualization scripts). Do you have any other
>>> thoughts on how I could go about dealing with this purely using spark and
>>> HDFS?
>>>
>>> Thank you
>>>
>>> On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <
>>> daniel.imber...@gmail.com> wrote:
>>>
>>>> Thank you Ted! That sounds like it would probably be the most efficient
>>>> (with the least overhead) way of handling this situation.
>>>>
>>>> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Another approach is to store the objects in NoSQL store such as HBase.
>>>>>
>>>>> Looking up object should be very fast.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
>>>>> daniel.imber...@gmail.com> wrote:
>>>>>
>>>>>> I'm looking for a way to send structures to pre-determined partitions
>>>>>> so that
>>>>>> they can be used by another RDD in a mapPartition.
>>>>>>
>>>>>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>>>>>> indexes. The inverted index objects are quite large.
>>>>>>
>>>>>> My hope is to do a MapPartitions within the RDD of vectors where I can
>>>>>> compare each vector to the inverted index. The issue is that I only
>>>>>> NEED one
>>>>>> inverted index object per partition (which would have the same key as
>>>>>> the
>>>>>> values within that partition).
>>>>>>
>>>>>>
>>>>>> val vectors:RDD[(Int, SparseVector)]
>>>>>>
>>>>>> val invertedIndexes:RDD[(Int, InvIndex)] =
>>>>>> a.reduceByKey(generateInvertedIndex)
>>>>>> vectors:RDD.mapPartitions{
>>>>>>     iter =>
>>>>>>          val invIndex = invertedIndexes(samePartitionKey)
>>>>>>          iter.map(invIndex.calculateSimilarity(_))
>>>>>>          )
>>>>>> }
>>>>>>
>>>>>> How could I go about setting up the Partition such that the specific
>>>>>> data
>>>>>> structure I need will be present for the mapPartition but I won't
>>>>>> have the
>>>>>> extra overhead of sending over all values (which would happen if I
>>>>>> were to
>>>>>> make a broadcast variable).
>>>>>>
>>>>>> One thought I have been having is to store the objects in HDFS but
>>>>>> I'm not
>>>>>> sure if that would be a suboptimal solution (It seems like it could
>>>>>> slow
>>>>>> down the process a lot)
>>>>>>
>>>>>> Another thought I am currently exploring is whether there is some way
>>>>>> I can
>>>>>> create a custom Partition or Partitioner that could hold the data
>>>>>> structure
>>>>>> (Although that might get too complicated and become problematic)
>>>>>>
>>>>>> Any thoughts on how I could attack this issue would be highly
>>>>>> appreciated.
>>>>>>
>>>>>> thank you for your help!
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>

Re: Sending large objects to specific RDDs

Reply via email to