Data source V2 in spark 2.4.0

2018-09-27 Thread AssafMendelson
Hi all, I understood from previous threads that the Data source V2 API will see some changes in spark 2.4.0, however, I can't seem to find what these changes are. Is there some documentation which summarizes the changes? The only mention I seem to find is this pull request:

IndexOutOfBoundException in catalyst when doing multiple approxDistinctCount

2017-08-08 Thread AssafMendelson
Hi, I am doing a large number of aggregations on a dataframe (without groupBy) to get some statistics. As part of this I am doing an approx_count_distinct(c, 0.01) Everything works fine but when I do the same aggregation a second time (for each column) I get the following error: [Stage 2:>

spark higher order functions

2017-06-20 Thread AssafMendelson
Hi, I have seen that databricks have higher order functions (https://docs.databricks.com/_static/notebooks/higher-order-functions.html, https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html) which basically allows to do generic

having trouble using structured streaming with file sink (parquet)

2017-06-14 Thread AssafMendelson
Hi all, I have recently started assessing structured streaming and ran into a little snag from the beginning. Basically I wanted to read some data, do some basic aggregation and write the result to file: import org.apache.spark.sql.functions.avg import

Adding metrics to spark datasource

2017-03-13 Thread AssafMendelson
Hi, I am building a data source so I can convert a custom source to dataframe. I have been going over examples such as JDBC and noticed that JDBC does the following: val inputMetrics = context.taskMetrics().inputMetrics and whenever a new record is added: inputMetrics.incRecordsRead(1)

building runnable distribution from source

2016-09-29 Thread AssafMendelson
Hi, I am trying to compile the latest branch of spark in order to try out some code I wanted to contribute. I was looking at the instructions to build from http://spark.apache.org/docs/latest/building-spark.html So at first I did: ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0

RE: How to make the result of sortByKey distributed evenly?

2016-09-06 Thread AssafMendelson
I imagine this is a sample example to explain a bigger concern. In general when you do a sort by key, it will implicitly shuffle the data by the key. Since you have 1 key (0) with 1 and the other with just 1 record it will simply shuffle it into two very skewed partitions. One way you can

Creating a UDF/UDAF using code generation

2016-09-04 Thread AssafMendelson
Hi, I want to write a UDF/UDAF which provides native processing performance. Currently, when creating a UDF/UDAF in a normal manner the performance is hit because it breaks optimizations. I tried something like this: import org.apache.spark.sql.catalyst.InternalRow import

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
Assaf From: ayan guha [mailto:guha.a...@gmail.com] Sent: Sunday, September 04, 2016 11:00 AM To: Mendelson, Assaf Cc: user Subject: Re: Scala Vs Python Hi This one is quite interesting. Is it possible to share few toy examples? On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
t. you might be wondering why would spark have it then? well probably because its ease of use for ML (that would be my best guess). [https://track.mixmax.com/api/track/v2/AD82gYqhkclMJCIdt/ISbvNmLslWYtdGQ5ATOoRnbhtmI] On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com<mail

using multiple worker instances in spark standalone

2016-09-01 Thread AssafMendelson
Hi, I have a machine with lots of memory. Since I understand all executors in a single worker run on the same JVM, I do not want to use just one worker for the whole memory. Instead I want to define multiple workers each with less than 30GB memory. Looking at the documentation I see this would

RE: Scala Vs Python

2016-09-01 Thread AssafMendelson
I believe this would greatly depend on your use case and your familiarity with the languages. In general, scala would have a much better performance than python and not all interfaces are available in python. That said, if you are planning to use dataframes without any UDF then the performance

broadcast fails on join

2016-08-30 Thread AssafMendelson
Hi, I am seeing a broadcast failure when doing a join as follows: Assume I have a dataframe df with ~80 million records I do: df2 = df.filter(cond) # reduces to ~50 million records grouped = broadcast(df.groupby(df2.colA).count()) total = df2.join(grouped, df2.colA == grouped.colA, "inner")

UDF/UDAF performance

2016-08-28 Thread AssafMendelson
I am trying to do a high performance calculations which require custom functions. As a first stage I am trying to profile the effect of using UDF and I am getting weird results. I created a simple test (in