Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Brian Cho
Hi Kay, Thank you for the detailed explanation. If I understand correctly, I *could* time each record processing time by measuring the time in reader.next, but this would add overhead for every single record. And this is the method that was abandoned because of performance regressions. The other

Re: [build system] short downtime next thursday morning, 5-12-16 @ 8am PDT

2016-05-11 Thread shane knapp
reminder: this is happening tomorrow morning! 7am PDT: builds paused 8am PDT: master reboot, upgrade happens 9am PDT: builds restarted On Mon, May 9, 2016 at 4:17 PM, shane knapp wrote: > reminder: this is happening thursday morning. > > On Wed, May 4, 2016 at 11:38 AM, shane knapp wrote:

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Reynold Xin
Adding Kay On Wed, May 11, 2016 at 12:01 PM, Brian Cho wrote: > Hi, > > I'm interested in adding read-time (from HDFS) to Task Metrics. The > motivation is to help debug performance issues. After some digging, its > briefly mentioned in SPARK-1683 that this feature didn't make it due to > metri

Shrinking the DataFrame lineage

2016-05-11 Thread Ulanov, Alexander
Dear Spark developers, Recently, I was trying to switch my code from RDDs to DataFrames in order to compare the performance. The code computes RDD in a loop. I use RDD.persist followed by RDD.count to force Spark compute the RDD and cache it, so that it does not need to re-compute it on each it

Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Brian Cho
Hi, I'm interested in adding read-time (from HDFS) to Task Metrics. The motivation is to help debug performance issues. After some digging, its briefly mentioned in SPARK-1683 that this feature didn't make it due to metric collection causing a performance regression [1]. I'd like to try tackling

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > >> Hi guys, >> >>

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu
In master branch, behavior is the same. Suggest opening a JIRA if you haven't done so. On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > Hi guys, > > I have a problem about spark DataFrame. My spark version is 1.6.1. > Basically, i used udf and df.withColumn to create a "new" column, and then

dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Tony Jin
Hi guys, I have a problem about spark DataFrame. My spark version is 1.6.1. Basically, i used udf and df.withColumn to create a "new" column, and then i filter the values on this new columns and call show(action). I see the udf function (which is used to by withColumn to create the new column) is

Re: Structured Streaming with Kafka source/sink

2016-05-11 Thread Ted Yu
Please see this thread: http://search-hadoop.com/m/q3RTt9XAz651PiG/Adhoc+queries+spark+streaming&subj=Re+Adhoc+queries+on+Spark+2+0+with+Structured+Streaming > On May 11, 2016, at 1:47 AM, Ofir Manor wrote: > > Hi, > I'm trying out Structured Streaming from current 2.0 branch. > Does the branch

Structured Streaming with Kafka source/sink

2016-05-11 Thread Ofir Manor
Hi, I'm trying out Structured Streaming from current 2.0 branch. Does the branch currently support Kafka as either source or sink? I couldn't find a specific JIRA or design doc for that in SPARK-8360 or in the examples... Is it still targeted for 2.0? Also, I naively assume it will look similar to