I haven't worked with datasets but would this help
https://stackoverflow.com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd
?
On Jun 23, 2017 5:43 PM, "Keith Chapman" wrote:
> Hi,
>
> I have code that does the following using RDDs,
>
> val outputPartitionCount = 300
> val part = ne
e. Sorry I can't be helpful,
> hopefully someone else will be able to explain exactly how this works.
>
--
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg
in any implementation based on Spark DataFrame.
>
>
> If you are using "spark.ml" package, then most ML libraries in it are
> based on DataFrame. So you shouldn't use "spark.default.parallelism",
> instead of "spark.sql.shuffle.partitions".
>
>
> Yon
Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Wed, 18 Jan, 2017 at 10:16 pm, Saliya Ekanayake
> wrote:
> Thank you, for the quick response. No, this is not Spark SQL. I am running
> the built-in PageRank.
>
> On Wed,
Thank you, for the quick response. No, this is not Spark SQL. I am running
the built-in PageRank.
On Wed, Jan 18, 2017 at 10:33 AM, wrote:
> Are you talking here of Spark SQL ?
>
> If yes, spark.sql.shuffle.partitions needs to be changed.
>
>
>
> *From:* Saliya E
terministic way?
Thank you,
Saliya
--
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg
Just realized the attached file has text formatting wrong. The github link
to the file is
https://github.com/esaliya/graphxprimer/blob/master/src/main/scala-2.10/org/saliya/graphxprimer/PregelExample2.scala
On Tue, Nov 22, 2016 at 3:08 PM, Saliya Ekanayake wrote:
> Hi,
>
> I've c
27;t clone Spark would send the same array that it got
after the initial call.
Is there a way to turn off this caching effect?
Thank you,
Saliya
--
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Network Dynamics and Simulation Science Laboratory (NDSSL)
Virginia Tech, Blacksburg
PregelEx
Hi,
I have created a property graph using GraphX. Each vertex has an integer
array as a property. I'd like to update the values of theses arrays without
creating new graph objects.
Is this possible in Spark?
Thank you,
Saliya
--
Saliya Ekanayake, Ph.D
Applied Computer Scientist
Ne
lowing similar partitioning on
> both RDDs
>
> On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake
> wrote:
>
>> Thank you, but isn't that join going to be too expensive for this?
>>
>> On Tue, Sep 13, 2016 at 11:55 PM, ayan guha wrote:
>>
>>>
filename,filecontent).
> 3. Join RDD1 and 2 based on some file name (or some other key).
>
> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake
> wrote:
>
>> 1.) What needs to be parallelized is the work for each of those 6M rows,
>> not the 80K files. Let me elaborate thi
ile has 6M rows, but total number of files~80K. is
> there a scenario where there may not be a file in HDFS corresponding to the
> row in first text file?
> 3. May be a follow up of 1, what is your end goal?
>
> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake
> wrote:
>
>
; On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" wrote:
>
>> Just wonder if this is possible with Spark?
>>
>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake
>> wrote:
>>
>>> Hi,
>>>
>>> I've got a text file where each line
Just wonder if this is possible with Spark?
On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake
wrote:
> Hi,
>
> I've got a text file where each line is a record. For each record, I need
> to process a file in HDFS.
>
> So if I represent these records as an RDD and invo
ere a better solution to that?
Thank you,
Saliya
--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
15 matches
Mail list logo