Just wondering if this is what you are implying Ryan (example only):
val data = (dataset to be partitionned)
val splitCondition =
s"""
CASE
WHEN …. THEN ….
WHEN …. THEN …..
END partition_condition
"""
val partitionedData = data.withColumn("partitionColumn", expr(splitCondi
Thank you Ryan and Xiao – sharing all this info really gives a very good
insight!
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Monday, December 3, 2018 at 12:05 PM
To: "Thakrar, Jayesh"
Cc: Xiao Li , Spark Dev List
Subject: Re: DataSourceV2 community sync #3
Ja
To: "Thakrar, Jayesh"
Cc: Ryan Blue , "u...@spark.apache.org"
Subject: Re: DataSourceV2 community sync #3
Hi, Jayesh,
This is a good question. Spark is a unified analytics engine for various data
sources. We are able to get the table schema from the underlying data sources
Just curious on the need for a catalog within Spark.
So Spark interface with other systems – many of which have a catalog of their
own – e.g. RDBMSes, HBase, Cassandra, etc. and some don’t (e.g. HDFS,
filesyststem, etc).
So what is the purpose of having this catalog within Spark for tables defin
Thank you for the quick reply Dongjoon.
This sound interesting and it might the resolution for our issue.
Let me see do some tests and will update the thread.
Thanks,
Jayesh
From: Dongjoon Hyun
Date: Wednesday, November 21, 2018 at 11:46 AM
To: "Thakrar, Jayesh"
Cc: dev
Subject:
Hi All,
We have some batch processing where we read 100s of thousands of ORC files.
What I found is that this was taking too much time AND that there was a long
pause between the point the read begins in the code and the executors get into
action.
That period is about 1.5+ hours where only the d
t;
Date: Wednesday, September 19, 2018 at 4:35 PM
To: "Thakrar, Jayesh"
Cc: "tigerqu...@outlook.com" , Spark Dev List
Subject: Re: [Discuss] Datasource v2 support for manipulating partitions
What does partition management look like in those systems and what are the
optio
platform.
Similar to the static/dynamic partition loading in Hive and Oracle.
So in short, I agree that partition management should be an optional interface.
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Wednesday, September 19, 2018 at 2:58 PM
To: "Thakrar, Jayesh
Thanks for the info Ryan – very helpful!
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Wednesday, September 19, 2018 at 3:17 PM
To: "Thakrar, Jayesh"
Cc: Wenchen Fan , Hyukjin Kwon ,
Spark Dev List
Subject: Re: data source api v2 refactoring
Hi Jayesh,
The exis
Totally agree with you Dale, that there are situations for efficiency,
performance and better control/visibility/manageability that we need to expose
partition management.
So as described, I suggested two things - the ability to do it in the current
V2 API form via options and appropriate imple
I am not involved with the design or development of the V2 API - so these could
be naïve comments/thoughts.
Just as dataset is to abstract away from RDD, which otherwise required a little
more intimate knowledge about Spark internals, I am guessing the absence of
partition operations is either d
connector may not be good and flexible
enough.
From: Russell Spitzer
Date: Tuesday, September 11, 2018 at 9:58 AM
To: "Thakrar, Jayesh"
Cc: Arun Mahadevan , Jungtaek Lim ,
Wenchen Fan , Reynold Xin , Ross
Lawley , Ryan Blue , dev
, "dbis...@us.ibm.com"
Subject: Re: Dat
So if Spark and the destination datastore are both non-transactional, you will
have to resort to an external mechanism for “transactionality”.
Here are some options for both RDBMS and non-transaction datastore destination.
For now assuming that Spark is used in batch mode (and not streaming mode)
Ryan et al,
Wondering if the existing Spark based data sources (e.g. for HDFS, Kafka) have
been ported to V2.
I remember reading threads where there were discussions about the
inefficiency/overhead of converting from Row to InternalRow that was preventing
certain porting effort etc.
I ask beca
data. I have lots of table to transport from one environment to other
and I don’t want to create unnecessary load on the DB.
On 7/12/18, 10:09 AM, "Thakrar, Jayesh"
wrote:
One option is to use plain JDBC to interrogate Postgresql catalog for
the source tab
One option is to use plain JDBC to interrogate Postgresql catalog for the
source table and generate the DDL to create the destination table.
Then using plain JDBC again, create the table at the destination.
See the link below for some pointers…..
https://stackoverflow.com/questions/2593803/how-t
From: Joseph Torres
Sent: Tuesday, May 1, 2018 1:58:54 PM
To: Ryan Blue
Cc: Thakrar, Jayesh; dev@spark.apache.org
Subject: Re: Datasource API V2 and checkpointing
I agree that Spark should fully handle state serialization and recovery for
most sources. This is how
Thanks Joseph!
From: Joseph Torres
Date: Friday, April 27, 2018 at 11:23 AM
To: "Thakrar, Jayesh"
Cc: "dev@spark.apache.org"
Subject: Re: Datasource API V2 and checkpointing
The precise interactions with the DataSourceV2 API haven't yet been hammered
out in desig
Wondering if this issue is related to SPARK-23323?
Any pointers will be greatly appreciated….
Thanks,
Jayesh
From: "Thakrar, Jayesh"
Date: Monday, April 23, 2018 at 9:49 PM
To: "dev@spark.apache.org"
Subject: Datasource API V2 and checkpointing
I was wondering when chec
I was wondering when checkpointing is enabled, who does the actual work?
The streaming datasource or the execution engine/driver?
I have written a small/trivial datasource that just generates strings.
After enabling checkpointing, I do see a folder being created under the
checkpoint folder, but t
Thanks Sameer!
From: Sameer Agarwal
Date: Sunday, April 15, 2018 at 10:02 PM
To: "Thakrar, Jayesh"
Cc: "dev@spark.apache.org" , Hyukjin Kwon
Subject: Re: V2.3 Scala API to Github Links Incorrect
[+Hyukjin]
Thanks for flagging this Jayesh.
https://github.com/apache/spark
In browsing through the API docs, the links to Github source code seem to be
pointing to a dev branch rather than the release branch.
Here's one example
Go to the API doc page below and click on the "ProcessingTime.scala" link which
points to Sameer's dev branch.
http://spark.apache.org/docs/lat
Thank you Jose for the quick reply!
I have made myself a watcher on them.
From: Joseph Torres
Date: Friday, April 6, 2018 at 10:41 AM
To: "Thakrar, Jayesh"
Cc: "dev@spark.apache.org"
Subject: Re: Spark 2.3 V2 Datasource API questions
Thanks for trying it out!
We haven
First of all thank you to the Spark dev team for coming up with the
standardized and intuitive API interfaces.
I am sure it will encourage integrating a lot more new datasource integration.
I have been creating playing with the API and have some questions on the
continuous streaming API
(see htt
mentioned and that requires the data sources to generate
RDD[Row] which is what I was looking for.
And yes, I understand the the API is still in flux and subject to change :)
Thanks again to both you and Ryan!
Jayesh
From: Wenchen Fan
Date: Thursday, March 22, 2018 at 6:59 PM
To: "Thakrar, Jayesh
g" to true.
But I should confess that I don't know the source code very well, so will
appreciate if you can point me to any other pointers/examples please.
From: Wenchen Fan
Date: Thursday, March 22, 2018 at 2:52 PM
To: "Thakrar, Jayesh"
Cc: "rb...@netflix.com" , &
.get(0).asInstanceOf[String])))
sqlContext.internalCreateDataFrame(internalRow, schema, isStreaming = true)
}
}
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Thursday, March 22, 2018 at 1:45 PM
To: "Thakrar, Jayesh"
Cc: "dev@spark.apache.org"
Subject: Re: Any reason for no
Because these are not exposed in the usual API, its not possible (or difficult)
to create custom structured streaming sources.
Consequently, one has to create streaming sources in packages under
org.apache.spark.sql.
Any pointers or info is greatly appreciated.
Is this in spark-shell or a spark-submit job?
If spark-submit job, is it local or cluster?
One reliable way of adding jars is to use the command line option "--jars"
See
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
for more info.
If you add ja
I am trying to create a custom streaming source in Spark 2.3.0 and getting the
following error:
scala> 2018-03-19 17:43:20 ERROR MicroBatchExecution:91 - Query [id =
48bb7a4c-7c66-4ad3-926b-81f8369a6efb, runId =
50800f9b-434d-43df-8d6a-3e0fdc865aeb] terminated with error
java.lang.AssertionErro
ows = myDataSourcePartition.rowCount
val partitionData = 1 to rows map(r => Row(s"Partition: ${partitionId}, row
${r} of ${rows}"))
partitionData.iterator
}
}
From: Wenchen Fan
Date: Wednesday, February 28, 2018 at 12:25 PM
To: "Thakrar, Jayesh"
Cc: "dev@
Hi All,
I was just toying with creating a very rudimentary RDD datasource to understand
the inner workings of RDDs.
It seems that one of the constructors for RDD has a parameter of type
SparkContext, but it (apparently) exists on the driver only and is not
serializable.
Consequently, any atte
Hi All,
I am trying to "test" a very simple custom receiver and am a little puzzled.
Using Spark 2.2.0 shell on my laptop, I am running the code below.
I was expecting the code to timeout since my timeout wait period is 1 ms and I
have a sleep in the class that is much more (1200 ms).
Is this n
this part.
Jayesh, shall I skip the window functions part since you are going to work on
that?
2016-12-15 22:48 GMT+01:00 Thakrar, Jayesh
mailto:jthak...@conversantmedia.com>>:
I too am interested in expanding the documentation for Spark SQL.
For my work I needed to get some info/exam
I too am interested in expanding the documentation for Spark SQL.
For my work I needed to get some info/examples/guidance on window functions and
have been using
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
.
How about divide and conquer?
From: Michael
Teng,
Before you go down creating your own custom Spark system, do give some thought
to what Holden and others are suggesting, viz. using implicit methods.
If you want real concrete examples, have a look at the Spark Cassandra
Connector -
Here you will see an example of "extending" SparkContex
36 matches
Mail list logo