date:20160511

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Brian Cho

Hi Kay, Thank you for the detailed explanation. If I understand correctly, I *could* time each record processing time by measuring the time in reader.next, but this would add overhead for every single record. And this is the method that was abandoned because of performance regressions. The other

Re: [build system] short downtime next thursday morning, 5-12-16 @ 8am PDT

2016-05-11 Thread shane knapp

reminder: this is happening tomorrow morning! 7am PDT: builds paused 8am PDT: master reboot, upgrade happens 9am PDT: builds restarted On Mon, May 9, 2016 at 4:17 PM, shane knapp wrote: > reminder: this is happening thursday morning. > > On Wed, May 4, 2016 at 11:38 AM, shane knapp wrote:

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Reynold Xin

Adding Kay On Wed, May 11, 2016 at 12:01 PM, Brian Cho wrote: > Hi, > > I'm interested in adding read-time (from HDFS) to Task Metrics. The > motivation is to help debug performance issues. After some digging, its > briefly mentioned in SPARK-1683 that this feature didn't make it due to > metri

Shrinking the DataFrame lineage

2016-05-11 Thread Ulanov, Alexander

Dear Spark developers, Recently, I was trying to switch my code from RDDs to DataFrames in order to compare the performance. The code computes RDD in a loop. I use RDD.persist followed by RDD.count to force Spark compute the RDD and cache it, so that it does not need to re-compute it on each it

Adding HDFS read-time metrics per task (RE: SPARK-1683)

2016-05-11 Thread Brian Cho

Hi, I'm interested in adding read-time (from HDFS) to Task Metrics. The motivation is to help debug performance issues. After some digging, its briefly mentioned in SPARK-1683 that this feature didn't make it due to metric collection causing a performance regression [1]. I'd like to try tackling

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton

This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > >> Hi guys, >> >>

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu

In master branch, behavior is the same. Suggest opening a JIRA if you haven't done so. On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > Hi guys, > > I have a problem about spark DataFrame. My spark version is 1.6.1. > Basically, i used udf and df.withColumn to create a "new" column, and then

dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Tony Jin

Hi guys, I have a problem about spark DataFrame. My spark version is 1.6.1. Basically, i used udf and df.withColumn to create a "new" column, and then i filter the values on this new columns and call show(action). I see the udf function (which is used to by withColumn to create the new column) is

Re: Structured Streaming with Kafka source/sink

2016-05-11 Thread Ted Yu

Please see this thread: http://search-hadoop.com/m/q3RTt9XAz651PiG/Adhoc+queries+spark+streaming&subj=Re+Adhoc+queries+on+Spark+2+0+with+Structured+Streaming > On May 11, 2016, at 1:47 AM, Ofir Manor wrote: > > Hi, > I'm trying out Structured Streaming from current 2.0 branch. > Does the branch

Structured Streaming with Kafka source/sink

2016-05-11 Thread Ofir Manor

Hi, I'm trying out Structured Streaming from current 2.0 branch. Does the branch currently support Kafka as either source or sink? I couldn't find a specific JIRA or design doc for that in SPARK-8360 or in the examples... Is it still targeted for 2.0? Also, I naively assume it will look similar to

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

Re: [build system] short downtime next thursday morning, 5-12-16 @ 8am PDT

Re: Adding HDFS read-time metrics per task (RE: SPARK-1683)

Shrinking the DataFrame lineage

Adding HDFS read-time metrics per task (RE: SPARK-1683)

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

dataframe udf functioin will be executed twice when filter on new column created by withColumn

Re: Structured Streaming with Kafka source/sink

Structured Streaming with Kafka source/sink

10 matches

Site Navigation

Mail list logo

Footer information