Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-21 Thread Wenchen Fan
IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will only scan all table files only once, and write back the inferred schema to metastore so that we don't need to do the schema inference again. So technically this will introduce a performance regression for the first query, but

[VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
Hi all, Following the SPIP process, I'm putting this SPIP up for a vote. The current data source API doesn't work well because of some limitations like: no partitioning/bucketing support, no columnar read, hard to support more operator push down, etc. I'm proposing a Data Source API V2 to

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
adding my own +1 (binding) On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi all, > > Following the SPIP process, I'm putting this SPIP up for a vote. > > The current data source API doesn't work well because of some limitations > like: n

Re: How to tune the performance of Tpch query5 within Spark

2017-07-14 Thread Wenchen Fan
Try to replace your UDF with Spark built-in expressions, it should be as simple as `$”x” * (lit(1) - $”y”)`. > On 14 Jul 2017, at 5:46 PM, 163 wrote: > > I modify the tech query5 to DataFrame: > val forders = >

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Wenchen Fan
It’s not about case when, but about rand(). Non-deterministic expressions are not allowed in join condition. > On 13 Jul 2017, at 6:43 PM, wangshuang wrote: > > I'm trying to execute hive sql on spark sql (Also on spark thriftserver), For > optimizing data skew, we use "case

Re: [SQL] Return Type of Round Func

2017-07-04 Thread Wenchen Fan
Hive compatibility is not a strong requirement for Spark SQL, and for round, SQLServer also returns the same type as input, see https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql#return-types

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Wenchen Fan
+1 > On 3 Jul 2017, at 8:22 PM, Nick Pentreath wrote: > > +1 (binding) > > On Mon, 3 Jul 2017 at 11:53 Yanbo Liang > wrote: > +1 > > On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier >

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-24 Thread Wenchen Fan
see https://issues.apache.org/jira/browse/SPARK-19611 On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Whats the regression this fixed in 2.1 from 2.0? > > On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan <wenc...@databricks.com> > w

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
n't think there has really been any discussion of this api change > yet or at least it hasn't occurred on the jira ticket > > On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> adding my own +1 (binding) >> >> On Thu, Aug 17, 2017 at 9:0

Re: appendix

2017-06-20 Thread Wenchen Fan
you should make hbase a data source(seems we already have hbase connector?), create a dataframe from hbase, and do join in Spark SQL. > On 21 Jun 2017, at 10:17 AM, sunerhan1...@sina.com wrote: > > Hello, > My scenary is like this: > 1.val df=hivecontext/carboncontex.sql("sql") >

Re: dataframe mappartitions problem

2017-06-20 Thread Wenchen Fan
`Dataset.mapPartitions` takes `func: Iterator[T] => Iterator[U]`, which means, spark need to deserialize the internal binary format to type `T`, and this deserialization is costly. If you do need to do some hack, you can use the internal API: `Dataset.queryExecution.toRdd.mapPartitions`, which

Re: A question about rdd transformation

2017-06-23 Thread Wenchen Fan
The exception message should include the lineage of the un-serializable object, can you post that too?On 23 Jun 2017, at 11:23 AM, Lionel Luffy wrote:add dev list. Who can help on below question?Thanks & Best Regards,LL-- Forwarded message --From: Lionel

Re: When will spark 2.0 support dataset python API?

2017-05-31 Thread Wenchen Fan
We tried but didn’t get much benefits from Python Dataset, as Python is dynamic typed and there is not much we can do to optimize running python functions. > On 31 May 2017, at 3:36 AM, Cyanny LIANG wrote: > > Hi, > Since DataSet API has become a common way to process

Re: [VOTE] Apache Spark 2.2.0 (RC3)

2017-06-02 Thread Wenchen Fan
I'm -1 on this. I merged a PR to master/2.2 today and break the build. I'm really sorry for the trouble and I should not be so aggressive when merging PRs. The actual reason is some misleading comments in the code and a bug in Spark's testing framework

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-14 Thread Wenchen Fan
That test case is trying to test the backward compatibility of `HiveExternalCatalog`. It downloads official Spark releases and creates tables with them, and then read these tables via the current Spark. About the download link, I just picked it from the Spark website, and this link is the default

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-16 Thread Wenchen Fan
ental convenience. While that may be ok when >>> distributing artifacts, it's more of a problem when actually building and >>> testing artifacts. In the latter case, the download should really only be >>> from an Apache mirror. >>> >>> On Thu, Sep

[discuss] Data Source V2 write path

2017-09-20 Thread Wenchen Fan
Hi all, I want to have some discussion about Data Source V2 write path before starting a voting. The Data Source V1 write path asks implementations to write a DataFrame directly, which is painful: 1. Exposing upper-level API like DataFrame to Data Source API is not good for maintenance. 2. Data

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-16 Thread Wenchen Fan
This vote passes with 3 binding +1 votes, 5 non-binding votes, and no -1 votes. Thanks all! +1 votes (binding): Wenchen Fan Reynold Xin Cheng Liang +1 votes (non-binding): Xiao Li Weichen Xu Vaquar khan Liwei Lin Dongjoon Hyun On Tue, Oct 17, 2017 at 12:30 AM, Dongjoon Hyun <dongjoo

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
This vote passes with 4 binding +1 votes, 10 non-binding votes, one +0 vote, and no -1 votes. Thanks all! +1 votes (binding): Wenchen Fan Herman van Hövell tot Westerflier Michael Armbrust Reynold Xin +1 votes (non-binding): Xiao Li Sameer Agarwal Suresh Thalamati Ryan Blue Xingbo Jiang

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
om:* wangzhenhua (G) <wangzhen...@huawei.com> >>> *Sent:* Friday, September 8, 2017 2:20:07 AM >>> *To:* Dongjoon Hyun; 蒋星博 >>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >>> Westerflier; Ryan Blue; Spark dev list; Suresh Thala

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-05 Thread Wenchen Fan
+1 on the design and proposed API. One detail I'd like to discuss is the 0-parameter UDF, how we can specify the size hint. This can be done in the PR review though. On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung wrote: > +1 on this and like the suggestion of type in

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Wenchen Fan
te side to a separate SPIP, too, since there > isn't much detail in the proposal and I think we should be more deliberate > with things like schema evolution. > > On Thu, Aug 31, 2017 at 10:33 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi Ryan, >> >> I

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-06 Thread Wenchen Fan
adding my own +1 (binding) On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi all, > > In the previous discussion, we decided to split the read and write path of > data source v2 into 2 SPIPs, and I'm sending this email to call a vote for > Dat

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Wenchen Fan
rdering that'd matter in the current set of pushdowns is limit - > it should always mean the root of the pushded tree. > > > On Fri, Sep 1, 2017 at 3:22 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> > Ideally also getting sort orders _after_ getting filters. >> >>

Re: [discuss] Data Source V2 write path

2017-09-26 Thread Wenchen Fan
ovide more details in options and do CTAS at Spark side. These can be done via options. After catalog federation, hopefully only file format data sources still use this backdoor. On Tue, Sep 26, 2017 at 8:52 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > > I think it is a bad idea to

Re: [discuss] Data Source V2 write path

2017-09-24 Thread Wenchen Fan
ed in the same way > like partitions are in the current format? > > On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi all, >> >> I want to have some discussion about Data Source V2 write path before >> starting a voting.

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
guess that's not terrible. I just don't understand why it is > necessary. > > On Mon, Sep 25, 2017 at 11:26 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> Catalog federation is to publish the Spark catalog API(kind of a data >> source API for metadata), so that Spark

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
t;r...@databricks.com> wrote: > Can there be an explicit create function? > > > On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> I agree it would be a clean approach if data source is only responsible >> to write into an already-con

Re: [discuss] Data Source V2 write path

2017-10-01 Thread Wenchen Fan
so > we can introduce consistent behavior across sources for v2. > > rb > > On Thu, Sep 28, 2017 at 8:49 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> > When this CTAS logical node is turned into a physical plan, the >> relation gets turned into a `Dat

[VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-02 Thread Wenchen Fan
Hi all, After we merge the infrastructure of data source v2 read path, and have some discussion for the write path, now I'm sending this email to call a vote for Data Source v2 write path. The full document of the Data Source API V2 is:

Re: [discuss] Data Source V2 write path

2017-09-28 Thread Wenchen Fan
gt; Comments inline. I've written up what I'm proposing with a bit more >> detail. >> >> On Tue, Sep 26, 2017 at 11:17 AM, Wenchen Fan <cloud0...@gmail.com> >> wrote: >> >>> I'm trying to give a summary: >>> >>> Ideally data

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Wenchen Fan
+1 On Tue, Oct 3, 2017 at 11:00 PM, Kazuaki Ishizaki wrote: > +1 (non-binding) > > I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for > core/sql-core/sql-catalyst/mllib/mllib-local have passed. > > $ java -version > openjdk version "1.8.0_131" >

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
If a table has no metastore > (Hadoop FS tables) then we can just pass the table metadata in when > creating the writer since there is no existence in this case. > > rb > ​ > > On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> I agree it w

[VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread Wenchen Fan
Hi all, It has been almost 2 weeks since I proposed the data source V2 for discussion, and we already got some feedbacks on the JIRA ticket and the prototype PR, so I'd like to call for a vote. The full document of the Data Source API V2 is:

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
hat >>>> the proposal says this: >>>> >>>> Ideally partitioning/bucketing concept should not be exposed in the >>>> Data Source API V2, because they are just techniques for data skipping and >>>> pre-partitioning. However, these 2 concepts are already

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Wenchen Fan
sting datasources leverage the cool Spark > features, and one that lets people who just want to implement basic > features do that - I'd try to include some kind of layering here. I could > probably sketch out something here if that'd be useful? > > James > > On Tue, 29 Aug 2017

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
>>> consider ways to fix that problem instead of carrying the problem forward >>> to Data Source V2. We can solve this by adding a high-level API for DDL and >>> a better write/insert API that works well with it. Clearly, that discussion >>> is independent of the r

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-29 Thread Wenchen Fan
Congratulations, Saisai! > On 29 Aug 2017, at 10:38 PM, Kevin Yu wrote: > > Congratulations, Jerry! > > On Tue, Aug 29, 2017 at 6:35 AM, Meisam Fathi > wrote: > Congratulations, Jerry! > > Thanks, > Meisam > > On

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread Wenchen Fan
, and then I return the parts I can’t handle. >>> >>> I’d prefer in general that this be implemented by passing some kind of >>> query plan to the datasource which enables this kind of replacement. >>> Explicitly don’t want to give the whole query plan - that soun

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Wenchen Fan
this is a good idea because of the following technical reasons. Thanks! On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi all, > > After we merge the infrastructure of data source v2 read path, and have > some discussion for the write path, now I'm

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Wenchen Fan
I'm adding my own +1 (binding). On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > I'm going to update the proposal: for the last point, although the > user-facing API (`df.write.format(...).option(...).mode(...).save()`) > mixes data and metadata

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-29 Thread Wenchen Fan
+1 On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki wrote: > +1 (non-binding) > > I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for > core/sql-core/sql-catalyst/mllib/mllib-local have passed. > > $ java -version > openjdk version "1.8.0_131" >

Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Wenchen Fan
My 2 cents: 1. when merging NullType with another type, the result should always be that type. 2. when merging StringType with another type, the result should always be StringType. 3. when merging integral types, the priority from high to low: DecimalType, LongType, IntegerType. This is because

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-03 Thread Wenchen Fan
+1. I think this architecture makes a lot of sense to let executors talk to source/sink directly, and bring very low latency. On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen wrote: > +0 simply because I don't feel I know enough to have an opinion. I have no > reason to doubt the

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Wenchen Fan
Data Source V2 is still under development. Ordering reporting is one of the planned features, but it's not done yet, we are still thinking about what the API should be, e.g. we need to include sort order, null first/last and other sorting related properties. On Mon, Dec 4, 2017 at 10:12 PM,

Re: How to persistent database/table created in sparkSession

2017-12-05 Thread Wenchen Fan
Try with `SparkSession.builder().enableHiveSupport` ? On Tue, Dec 5, 2017 at 3:22 PM, 163 wrote: > Hi, > How can I persistent database/table created in spark application? > > object TestPersistentDB { > def main(args:Array[String]): Unit = { >

Re: Dataset API Question

2017-10-25 Thread Wenchen Fan
It's because of different API design. *RDD.checkpoint* returns void, which means it mutates the RDD state so you need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed. *Dataset.checkpoint* returns a new Dataset, which means there is no isCheckpointed state in Dataset, and

Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-17 Thread Wenchen Fan
SPARK-22371 turns an error to a warning, so it won't break any existing workloads. Let me backport it to 2.3 so users won't hit this problem in the new release. On Fri, May 18, 2018 at 5:59 AM, Imran Rashid wrote: > I just found

Re: Preventing predicate pushdown

2018-05-15 Thread Wenchen Fan
applying predict pushdown is an optimization, and it makes sense to provide configs to turn off certain optimizations. Feel free to create a JIRA. Thanks, Wenchen On Tue, May 15, 2018 at 8:33 PM, Tomasz Gawęda wrote: > Hi, > > while working with JDBC datasource I saw

Re: [VOTE] Spark 2.3.1 (RC2)

2018-05-23 Thread Wenchen Fan
We found a critical bug in tungsten that can lead to silent data corruption: https://github.com/apache/spark/pull/21311 This is a long-standing bug that starts with Spark 2.0(not a regression), but since we are going to release 2.3.1, I think it's a good chance to include this fix. We will also

Re: Time for 2.1.3

2018-06-15 Thread Wenchen Fan
+1 On Fri, Jun 15, 2018 at 7:10 AM, Tom Graves wrote: > +1 for doing a 2.1.3 release. > > Tom > > On Wednesday, June 13, 2018, 7:28:26 AM CDT, Marco Gaido < > marcogaid...@gmail.com> wrote: > > > Yes, you're right Herman. Sorry, my bad. > > Thanks. > Marco > > 2018-06-13 14:01 GMT+02:00 Herman

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Wenchen Fan
+1 On Tue, Jun 5, 2018 at 1:20 AM, Henry Robinson wrote: > +1 > > (I hope there will be a fuller design document to review, since the SPIP > is really light on details). > > On 4 June 2018 at 10:17, Joseph Bradley wrote: > >> +1 >> >> On Sun, Jun 3, 2018 at 9:59 AM, Weichen Xu >> wrote: >>

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-02 Thread Wenchen Fan
+1 On Sun, Jun 3, 2018 at 6:54 AM, Marcelo Vanzin wrote: > If you're building your own Spark, definitely try the hadoop-cloud > profile. Then you don't even need to pull anything at runtime, > everything is already packaged with Spark. > > On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas >

Re: Spark data source resiliency

2018-07-03 Thread Wenchen Fan
a failure in the data reader results to a task failure, and Spark will re-try the task for you (IIRC re-try 3 times before fail the job). Can you check your Spark log and see if the task fails consistently? On Tue, Jul 3, 2018 at 2:17 PM assaf.mendelson wrote: > Hi All, > > I am implemented a

Re: Spark data source resiliency

2018-07-03 Thread Wenchen Fan
I believe you are using something like `local[8]` as your Spark mater, which can't retry tasks. Please try `local[8, 3]` which can re-try failed tasks 3 times. On Tue, Jul 3, 2018 at 2:42 PM assaf.mendelson wrote: > That is what I expected, however, I did a very simple test (using println >

Re: AccumulatorV2 vs AccumulableParam (V1)

2018-05-03 Thread Wenchen Fan
Hi Sergey, Thanks for your valuable feedback! For 1: yea this is definitely a bug and I have sent a PR to fix it. For 2: I have left my comments on the JIRA ticket. For 3: I don't quite understand it, can you give some concrete examples? For 4: yea this is a problem, but I think it's not a big

Re: Custom datasource as a wrapper for existing ones?

2018-05-03 Thread Wenchen Fan
Hi Jakub, Yea I think data source would be the most elegant way to solve your problem. Unfortunately in Spark 2.3 the only stable data source API is data source v1, which can't be used to implement high-performance data source. Data source v2 is still a preview version in Spark 2.3 and may change

Re: Why some queries use logical.stats while others analyzed.stats?

2018-01-06 Thread Wenchen Fan
l/ > basicLogicalOperators.scala#L895 > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Stream

Re: Broken SQL Visualization?

2018-01-15 Thread Wenchen Fan
Hi, thanks for reporting, can you include the steps to reproduce this bug? On Tue, Jan 16, 2018 at 7:07 AM, Ted Yu wrote: > Did you include any picture ? > > Looks like the picture didn't go thru. > > Please use third party site. > > Thanks > > Original message

Re: Why Dataset.hint uses logicalPlan (= analyzed not planWithBarrier)?

2018-01-26 Thread Wenchen Fan
Looks like we missed this one, feel free to submit a patch, thanks for your finding! On Fri, Jan 26, 2018 at 3:39 PM, Jacek Laskowski wrote: > Hi, > > I've just noticed that every time Dataset.hint is used it triggers > execution of logical commands, their unions and hint

Re: Distinct on Map data type -- SPARK-19893

2018-01-12 Thread Wenchen Fan
Actually Spark 2.1.0 doesn't work for your case, it may give you wrong result... We are still working on adding this feature, but before that, we should fail earlier instead of returning wrong result. On Sat, Jan 13, 2018 at 11:02 AM, ckhari4u wrote: > I see SPARK-19893 is

Re: ClassNotFoundException while running unit test with local cluster mode in Intellij IDEA

2018-01-30 Thread Wenchen Fan
You can run test in SBT and attach your IDEA to it for debugging, which works for me. On Tue, Jan 30, 2018 at 7:44 PM, wuyi wrote: > Dear devs, > I'v got stuck on this issue for several days, and I need help now. > At the first, I run into an old issue, which is the

Re: [SQL] [Suggestion] Add top() to Dataset

2018-01-30 Thread Wenchen Fan
You can use `Dataset.limit`, which return a new `Dataset` instead of an Array. Then you can transform it and still get the top k optimization from Spark. On Wed, Jan 31, 2018 at 3:39 PM, Yacine Mazari wrote: > Thanks for the quick reply and explanation @rxin. > > So if one

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-05 Thread Wenchen Fan
I think many advanced Spark users already have customer catalyst rules, to deal with the query plan directly, so it makes a lot of sense to standardize the logical plan. However, instead of exploring possible operations ourselves, I think we should follow the SQL standard. ReplaceTable, RTAS:

Re: There is no space for new record

2018-02-09 Thread Wenchen Fan
This has been reported before: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-tc20108.html I think we may have a real bug here, but we need a reproduce. Can you provide one? thanks! On Fri, Feb 9, 2018 at 5:59 PM,

Re: There is no space for new record

2018-02-09 Thread Wenchen Fan
It should be fixed by https://github.com/apache/spark/pull/20561 soon. On Fri, Feb 9, 2018 at 6:16 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > This has been reported before: http://apache-spark- > developers-list.1001551.n3.nabble.com/java-lang- > IllegalStateException-Th

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Wenchen Fan
+1 On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin wrote: > +1 > > On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal , > wrote: > > this file shouldn't be included? https://dist.apache.org/repos/ >> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml >> >

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-22 Thread Wenchen Fan
+1 On Fri, Feb 23, 2018 at 6:23 AM, Sameer Agarwal wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.0. The vote is open until Tuesday February 27, 2018 at 8:00:00 am UTC > and passes if a majority of at least 3 PMC +1 votes are cast. >

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Wenchen Fan
ng(Ryan) Zhu >>>> >> >> <shixi...@databricks.com> wrote: >>>> >> >>> >>>> >> >>> I'm -1 because of the UI regression >>>> >> >>> https://issues.apache.org/jira/browse/SPARK-23470 :

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Wenchen Fan
ARK-23470 : the All Jobs >>> page >>> >> >>> may be >>> >> >>> too slow and cause "read timeout" when there are lots of jobs and >>> >> >>> stages. >>> >> >>> This is one of the most i

Re: [DISCUSS] Multiple catalog support

2018-07-27 Thread Wenchen Fan
I think the major issue is, now users have 2 ways to create a specific data source table: 1) use the USING syntax. 2) create the table in the specific catalog. It can be super confusing if users create a cassandra table in hbase data source. Also we can't drop the USING syntax as data source v1

Re: [DISCUSS] Multiple catalog support

2018-08-01 Thread Wenchen Fan
he “feature” that you can write a different > schema to a path-based JSON table without needing to run an “alter table” > on it to update the schema. If this is behavior we want to preserve (and I > think it is) then we need to clearly state what that behavior is. > > Second, I think that we

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-30 Thread Wenchen Fan
Another two correctness bug fixes were merged to 2.3 today: https://issues.apache.org/jira/browse/SPARK-24934 https://issues.apache.org/jira/browse/SPARK-24957 On Mon, Jul 30, 2018 at 1:19 PM Xiao Li wrote: > Sounds good to me. Thanks! Today, we merged another correctness fix >

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Wenchen Fan
, I am close but need some more time. > We could get it into 2.4. > > Stavros > > On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan wrote: > >> This seems fine to me. >> >> BTW Ryan Blue and I are working on some data source v2 stuff and >> hopefully we can

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Wenchen Fan
I went through the open JIRA tickets and here is a list that we should consider for Spark 2.4: *High Priority*: SPARK-24374 : Support Barrier Execution Mode in Apache Spark This one is critical to the Spark ecosystem for deep learning. It only

Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Wenchen Fan
Hi Carson and Yuanjian, Thanks for contributing to this project and sharing the production use cases! I believe the adaptive execution will be a very important feature of Spark SQL and will definitely benefit a lot of users. I went through the design docs and the high-level design totally makes

Re: Data source V2

2018-07-31 Thread Wenchen Fan
Hi assaf, Thanks for trying data source v2! Data source v2 is still evolving(we marked all the data source v2 interface as @Evolving), and we've already made a lot of API changes in this release(some renaming, switching to InternalRow, etc.). So I'd not encourage people to use data source v2 in

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-27 Thread Wenchen Fan
This seems fine to me. BTW Ryan Blue and I are working on some data source v2 stuff and hopefully we can get more things done with one more week. Thanks, Wenchen On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang wrote: > Xiangrui and I are leading an effort to implement a highly desirable >

Re: Writing file

2018-07-31 Thread Wenchen Fan
It depends on how you deploy Spark. The writer just writes data to your specified path(HDFS or local path), but the writer is run on executors. If you deploy Spark with the local mode, i.e. executor and driver are together, then you will see the output file on the driver node. If you deploy Spark

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Wenchen Fan
Here is my interpretation of your proposal, please correct me if something is wrong. End users can read/write a data source with its name and some options. e.g. `df.read.format("xyz").option(...).load`. This is currently the only end-user API for data source v2, and is widely used by Spark users

DISCUSS: SPARK-24882 data source v2 API improvement

2018-07-31 Thread Wenchen Fan
Hi all, Data source v2 is out for a while. During this release, we migrated most of the streaming sources to the v2 API (SPARK-22911 ) started to migrate file sources (SPARK-23817 ) started to

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Wenchen Fan
ocumentation of the issue so that users are less likely to stumble into >> this unaware; but really we need to fix at least the most common cases of >> this bug. Backports to maintenance branches are also probably in order. >> >> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid >> wr

Re: [DISCUSS] SparkR support on k8s back-end for Spark 2.4

2018-08-15 Thread Wenchen Fan
I'm also happy to see we have R support on k8s for Spark 2.4. I'll do the manual testing for it if we don't want to upgrade the OS now. If the Python support is also merged in this way, I think we can merge the R support PR too? On Thu, Aug 16, 2018 at 7:23 AM shane knapp wrote: > >> What is

Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Wenchen Fan
SPARK-25051 is resolved, can we start a new RC? SPARK-16406 is an improvement, generally we should not backport. On Wed, Aug 15, 2018 at 5:16 AM Sean Owen wrote: > (We wouldn't consider lack of an improvement to block a maintenance > release. It's reasonable to raise this elsewhere as a big

[SPARK-24771] Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Wenchen Fan
Hi all, We've upgraded Avro from 1.7 to 1.8, to support date/timestamp/decimal types in the newly added Avro data source in the coming Spark 2.4, and also to make Avro work with Parquet. Since Avro 1.8 is not binary compatible with Avro 1.7 (see https://issues.apache.org/jira/browse/AVRO-1502),

Re: Set up Scala 2.12 test build in Jenkins

2018-08-05 Thread Wenchen Fan
It seems to me that the closure cleaner fails to clean up something. The failed test case defines a serializable class inside the test case, and the class doesn't refer to anything in the outer class. Ideally it can be serialized after cleaning up the closure. This is somehow a very weird way to

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Wenchen Fan
Some updates for the JIRA tickets that we want to resolve before Spark 2.4. green: merged orange: in progress red: likely to miss SPARK-24374 : Support Barrier Execution Mode in Apache Spark The core functionality is finished, but we still need

Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-17 Thread Wenchen Fan
ved before we can start getting the API in? If so, what do you think > needs to be decided to get it ready? > > Thanks! > > rb > > On Wed, Jul 11, 2018 at 8:24 PM Wenchen Fan wrote: > >> Hi Ryan, >> >> Great job on this! Shall we call a vote for the plan stan

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread Wenchen Fan
+1 (binding). I think this is more clear to both users and developers, compared to the existing one which only supports append/overwrite and doesn't work with tables in data source(like JDBC table) well. On Wed, Jul 18, 2018 at 2:06 AM Ryan Blue wrote: > +1 (not binding) > > On Tue, Jul 17,

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-15 Thread Wenchen Fan
+1. The Spark 2.3 regressions I'm aware of are all fixed. On Sun, Jul 15, 2018 at 4:09 PM Saisai Shao wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.2. > > The vote is open until July 20 PST and passes if a majority +1 PMC votes > are cast, with a

Re: data source api v2 refactoring

2018-09-04 Thread Wenchen Fan
ase it was dropped. > > -- Forwarded message - > From: Wenchen Fan > Date: Mon, Sep 3, 2018 at 6:16 AM > Subject: Re: data source api v2 refactoring > To: > Cc: Ryan Blue , Reynold Xin , < > dev@spark.apache.org> > > > Hi Mridul, > >

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Wenchen Fan
+ Liang-Chi and Herman, I think this is a common requirement to get top N records. For now we guarantee it by the `TakeOrderedAndProject` operator. However, this operator may not be used if the spark.sql.execution.topKSortFallbackThreshold config has a small value. Shall we reconsider

Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Wenchen Fan
The repartition correctness bug fix is merged. The Scala 2.12 PRs mentioned in this thread are all merged. The Kryo upgrade is done. I'm going to cut the branch 2.4 since all the major blockers are now resolved. Thanks, Wenchen On Sun, Sep 2, 2018 at 12:07 AM sadhen wrote: >

Re: data source api v2 refactoring

2018-09-07 Thread Wenchen Fan
th what you propose, >> assuming that I understand it correctly. >> >> rb >> >> On Tue, Sep 4, 2018 at 8:42 PM Wenchen Fan wrote: >> >>> I'm switching to my another Gmail account, let's see if it still gets >>> dropped this time. >>>

Re: Branch 2.4 is cut

2018-09-07 Thread Wenchen Fan
prevent any >> regression. >> >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >> >> Bests, >> Dongjoon. >> >> >> On Thu, Sep 6, 2018 at 6:56 AM Wenchen Fan wrote: >> >>> Good news! I'll try and upd

Re: Datasource v2 Select Into support

2018-09-06 Thread Wenchen Fan
Data source v2 catalog support(table/view) is still in progress. There are several threads in the dev list discussing it, please join the discussion if you are interested. Thanks for trying! On Thu, Sep 6, 2018 at 7:23 PM Ross Lawley wrote: > Hi, > > I hope this is the correct mailinglist. I've

Re: Branch 2.4 is cut

2018-09-06 Thread Wenchen Fan
> Let's try also producing a 2.12 build with this release. The machinery > should be there in the release scripts, but let me know if something fails > while running the release for 2.12. > > On Thu, Sep 6, 2018 at 12:32 AM Wenchen Fan wrote: > >> Hi all, >> >

Branch 2.4 is cut

2018-09-06 Thread Wenchen Fan
Hi all, I've cut the branch-2.4 since all the major blockers are resolved. If no objections I'll shortly followup with an RC to get the QA started in parallel. Committers, please only merge PRs to branch-2.4 that are bug fixes, performance regression fixes, document changes, or test suites

Re: Branch 2.4 is cut

2018-09-10 Thread Wenchen Fan
xpected to change further due redesigns before 3.0 so don't see much > value releasing it in 2.4. > > On Sun, 9 Sep 2018 at 22:42, Wenchen Fan wrote: > >> Strictly speaking, data source v2 is always half-finished until we mark >> it as stable. We need some small milestones to move

Re: Branch 2.4 is cut

2018-09-09 Thread Wenchen Fan
Strictly speaking, data source v2 is always half-finished until we mark it as stable. We need some small milestones to move forward step by step. The redesign also happens in an incremental way. SPARK-24882 mostly focus on the "RDD" part of the API: the separation of reader factory and input

  1   2   3   4   5   6   >