Re: Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread t4
https://issues.apache.org/jira/browse/SPARK-23576 ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Matei Zaharia
Hi Steffen,

Thanks for sharing your results about MLlib — this sounds like a useful tool. 
However, I wanted to point out that some of the results may be expected for 
certain machine learning algorithms, so it might be good to design those tests 
with that in mind. For example:

> - The classification of LogisticRegression, DecisionTree, and RandomForest 
> were not inverted when all binary class labels are flipped.
> - The classification of LogisticRegression, DecisionTree, GBT, and 
> RandomForest sometimes changed when the features are reordered.
> - The classification of LogisticRegression, RandomForest, and LinearSVC 
> sometimes changed when the instances are reordered.

All of these things might occur because the algorithms are nondeterministic. 
Were the effects large or small? Or, for example, was the final difference in 
accuracy statistically significant? Many ML algorithms are trained using 
randomized algorithms like stochastic gradient descent, so you can’t expect 
exactly the same results under these changes.

> - The classification of NaïveBayes and the LinearSVC sometimes changed if one 
> is added to each feature value.

This might be due to nondeterminism as above, but it might also be due to 
regularization or nonlinear effects for some algorithms. For example, some 
algorithms might look at the relative values of features, in which case adding 
1 to each feature value transforms the data. Other algorithms might require 
that data be centered around a mean of 0 to work best.

I haven’t read the paper in detail, but basically it would be good to account 
for randomized algorithms as well as various model assumptions, and make sure 
the differences in results in these tests are statistically significant.

Matei


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread Long, Andrew
Hello Friends,

I’ve encountered a bug where spark silently corrupts data when reading from a 
parquet hive table where the table schema does not match the file schema.  I’d 
like to give a shot at adding some extra validations to the code to handle this 
corner case and I was wondering if anyone had any suggestions for where to 
start looking in the spark code.

Cheers Andrew


Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-22 Thread Ankur Gupta
Thanks for your responses Saisai and Marco.

I agree that "rename" operation can be time-consuming on object storage,
which can potentially delay the shutdown.

I also agree that customers/users have a way to use log appenders to write
log files and then send them along with Yarn application logs but I still
think it is a cumbersome process. Also, there is the issue that customers
cannot easily identify which logs belong to which application, without
reading the log file. And if users run multiple applications with default
log4j configurations on the same host, then they can end up writing to the
same log file.

Because of the issues mentioned above, we can maybe think of this as an
optional feature, which will be disabled by default but turned on by
customers. This will solve the problems mentioned above, reduce the
overhead on users/customers while adding a bit of overhead during the
shutdown phase of Spark Application.

Thanks,
Ankur

On Wed, Aug 22, 2018 at 1:36 AM Marco Gaido  wrote:

> I agree with Saisai. You can also configure log4j to append anywhere else
> other than the console. Many companies have their system for collecting and
> monitoring logs and they just customize the log4j configuration. I am not
> sure how needed this change would be.
>
> Thanks,
> Marco
>
> Il giorno mer 22 ago 2018 alle ore 04:31 Saisai Shao <
> sai.sai.s...@gmail.com> ha scritto:
>
>> One issue I can think of is that this "moving the driver log" in the
>> application end is quite time-consuming, which will significantly delay the
>> shutdown. We already suffered such "rename" problem for event log on object
>> store, the moving of driver log will make the problem severe.
>>
>> For a vanilla Spark on yarn client application, I think user could
>> redirect the console outputs to log and provides both driver log and yarn
>> application log to the customers, this seems not a big overhead.
>>
>> Just my two cents.
>>
>> Thanks
>> Saisai
>>
>> Ankur Gupta  于2018年8月22日周三 上午5:19写道:
>>
>>> Hi all,
>>>
>>> I want to highlight a problem that we face here at Cloudera and start a
>>> discussion on how to go about solving it.
>>>
>>> *Problem Statement:*
>>> Our customers reach out to us when they face problems in their Spark
>>> Applications. Those problems can be related to Spark, environment issues,
>>> their own code or something else altogether. A lot of times these customers
>>> run their Spark Applications in Yarn Client mode, which as we all know,
>>> uses a ConsoleAppender to print logs to the console. These customers
>>> usually send their Yarn logs to us to troubleshoot. As you may have
>>> figured, these logs do not contain driver logs and makes it difficult for
>>> us to troubleshoot the issue. In that scenario our customers end up running
>>> the application again, piping the output to a log file or using a local log
>>> appender and then sending over that file.
>>>
>>> I believe that there are other users in the community who also face
>>> similar problem, where the central team managing Spark clusters face
>>> difficulty in helping the end users because they ran their application in
>>> shell or yarn client mode (I am not sure what is the equivalent in Mesos).
>>>
>>> Additionally, there may be teams who want to capture all these logs so
>>> that they can be analyzed at some later point in time and the fact that
>>> driver logs are not a part of Yarn Logs causes them to capture only partial
>>> logs or makes it difficult to capture all the logs.
>>>
>>> *Proposed Solution:*
>>> One "low touch" approach will be to create an ApplicationListener which
>>> listens for Application Start and Application End events. On Application
>>> Start, this listener will append a Log Appender which writes to a local or
>>> remote (eg:hdfs) log file in an application specific directory and moves
>>> this to Yarn's Remote Application Dir (or equivalent Mesos Dir) on
>>> application end. This way the logs will be available as part of Yarn Logs.
>>>
>>> I am also interested in hearing about other ideas that the community may
>>> have about this. Or if someone has already solved this problem, then I
>>> would like them to contribute their solution to the community.
>>>
>>> Thanks,
>>> Ankur
>>>
>>


Spark github sync works now

2018-08-22 Thread Xiao Li
FYI. The Spark github sync was 10 hour behind this morning. You might get
fail merges because of this. Just triggered a re-sync. It should work now.

Thanks,

Xiao


Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Maciej Szymkiewicz
Given popularity of related SO questions:


   - https://stackoverflow.com/q/41670103/1560062
   - https://stackoverflow.com/q/42465568/1560062
   - https://stackoverflow.com/q/41670103/1560062

it is probably more "nobody thought about asking",  than "it is not used
often".

On Wed, 22 Aug 2018 at 00:07, Reynold Xin  wrote:

> Probably just because it is not used that often and nobody has submitted a
> patch for it. I've used pivot probably on average once a week (primarily in
> spreadsheets), but I've never used unpivot ...
>
>
> On Tue, Aug 21, 2018 at 3:06 PM Ivan Gozali  wrote:
>
>> Hi there,
>>
>> I was looking into why the UNPIVOT feature isn't implemented, given that
>> Spark already has PIVOT implemented natively in the DataFrame/Dataset API.
>>
>> Came across this JIRA  
>> which
>> talks about implementing PIVOT in Spark 1.6, but no mention whatsoever
>> regarding UNPIVOT, even though the JIRA curiously references a blog post
>> that talks about both PIVOT and UNPIVOT :)
>>
>> Is this because UNPIVOT is just simply generating multiple slim tables by
>> selecting each column, and making a union out of all of them?
>>
>> Thank you!
>>
>> --
>> Regards,
>>
>>
>> Ivan Gozali
>> Lecida
>> Email: i...@lecida.com
>>
>


Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Mike Hynes
Hi Reynold/Ivan,

People familiar with pandas and R dataframes will likely have used the
dataframe "melt" idiom, which is the functionality I believe you are
referring to:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html

I have had to write this function myself in my own work in Spark SQL, as it
is a common step in data wrangling when you do not control the structure of
the input dataframes you are working with in your pipelines.

I would hence second Ivan that adding it as a native dataframe method would
no doubt be helpful (and for what it's worth, so would other concepts from
the pandas API, such as named indexing & multilevel indexing).

Cheers,
Mike




On Tue, Aug 21, 2018, 5:07 PM Reynold Xin,  wrote:

> Probably just because it is not used that often and nobody has submitted a
> patch for it. I've used pivot probably on average once a week (primarily in
> spreadsheets), but I've never used unpivot ...
>
>
> On Tue, Aug 21, 2018 at 3:06 PM Ivan Gozali  wrote:
>
>> Hi there,
>>
>> I was looking into why the UNPIVOT feature isn't implemented, given that
>> Spark already has PIVOT implemented natively in the DataFrame/Dataset API.
>>
>> Came across this JIRA  
>> which
>> talks about implementing PIVOT in Spark 1.6, but no mention whatsoever
>> regarding UNPIVOT, even though the JIRA curiously references a blog post
>> that talks about both PIVOT and UNPIVOT :)
>>
>> Is this because UNPIVOT is just simply generating multiple slim tables by
>> selecting each column, and making a union out of all of them?
>>
>> Thank you!
>>
>> --
>> Regards,
>>
>>
>> Ivan Gozali
>> Lecida
>> Email: i...@lecida.com
>>
>


[MLlib][Test] Smoke and Metamorphic Testing of MLlib

2018-08-22 Thread Steffen Herbold

Dear developers,

I am writing you because I applied an approach for the automated testing 
of classification algorithms to Spark MLlib and would like to forward 
the results to you.


The approach is a combination of smoke testing and metamorphic testing. 
The smoke tests try to find problems by executing the training and 
prediction functions of classifiers with different data. These smoke 
tests should ensure the basic functioning of classifiers. I defined 20 
different data sets, some very simple (uniform features in [0,1]), some 
with extreme distributions, e.g., data close to machine precision. The 
metamorphic tests determine if classification results change as expected 
if the training data is modified, e.g., by reordering features, flipping 
class labels, or reordering instances.


I generated 70 different JUnit tests for six different Spark ML 
classifiers. In summary, I found the following potential problems:
- One error due to a value being out of bounds for the Logistic 
regression classifier if data approaches MAXDOUBLE. Which bound is 
affected is not explained.
- The classification of NaïveBayes and the LinearSVC sometimes changed 
if one is added to each feature value.
- The classification of LogisticRegression, DecisionTree, and 
RandomForest were not inverted when all binary class labels are flipped.
- The classification of LogisticRegression, DecisionTree, GBT, and 
RandomForest sometimes changed when the features are reordered.
- The classification of LogisticRegression, RandomForest, and LinearSVC 
sometimes changed when the instances are reordered.


You can find details of our results online [1]. The provided resources 
include the current draft of the paper that describes the tests as well 
as detailed results in detail. Moreover, we provide an executable test 
suite with all tests we executed, as well as the export of our test 
results as XML file that contains all details of the test execution, 
including stack traces in case of exceptions. The preprint and online 
materials also contain the results for two other machine learning 
libraries, i.e., Weka and scikit-learn. Additionally, you can find the 
atoml tool used to generate the tests on GitHub [2].


I hope that these tests may help with the future development of Spark 
MLlib. You could help me a lot by answering the following questions:

- Do you consider the tests helpful?
- Do you consider any source code or documentation changes due to our 
findings?
- Would you be interested in a pull request or any other type of 
integration of (a subset of) the tests into your project?
- Would you be interested in more such tests, e.g., for the 
consideration of hyper parameters, other algorithm types like 
clustering, or more complex algorithm specific metamorphic tests?


I am looking forward to your feedback.

Best regards,
Steffen Herbold

[1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
[2] https://github.com/sherbold/atoml

--
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstraße 7
37077 Göttingen, Germany
mailto. herb...@cs.uni-goettingen.de
tel. +49 551 39-172037


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-22 Thread makatun
Manu,
thank you very much for your response. 

1. Your post helps to further optimize the spark jobs for wide data.
(https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015)
 The suggested change of code:

df.select(df.columns.map { col =>
  df(col).isNotNull
}: _*)

provides much better performance compared to the previous approach (where we
use .withColumn method and loop over initial columns). The difference
becomes astonishing when using the current Spark master (2.4.0-SNAPSHOT).
Please, see the results of our measurements at the plot below.

 
The combination of the recent improvement in the Catalyst optimizer and more
efficient code makes the game changing difference: the job duration becomes
linear on the number of columns. The test jobs are able to process 40K
columns in less than 20 seconds. In contrast, before the optimizations (see
the code from the previous posts) the jobs were not able to process more
than 1600 columns (which was taking minutes). 

2.  CACHING

Manu Zhang wrote
>>For 2, I guess `cache` will break up the logical plan and force it be
>>analyzed.

According to this explanation, caching does not break the logical plan:

https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md#cache-and-checkpoint

The structure of our testing jobs with caching (see our previous posts with
the code and results) is very basic:
.read csv -> .cache -> .count 
compared to:
.read csv -> .count
Addition of caching increases the job duration significantly. This is
especially critical in Spark-2.4.0-SNAPSHOT. There, the jobs have linear
duration on the number of columns without caching, but it becomes polynomial
when caching is added. The csv files used for testing are approx. 2MB, so it
should not be a problem to accommodate them in memory. As far as we
understand, this is not an expected behavior of caching.  

3. 

antonkulaga  wrote
>> did you try to test somewhing more complex, like dataframe.describe or
>> PCA?

Anton Kulaga,
we use dataframe.describe mostly for the debugging purposes. Its execution
takes additional time, but we did not perform measurements, because,
typically, it is not included in the production jobs.
We also did not tested PCA transformations. It would be very interesting if
you could share your observations/measurements for those. 

CONCLUSION:
-Using .withColumn has a high cost in Catalyst optimizer. Alternative
approach using .select with mapping of columns allows to reduce job duration
dramatically and enables processing tables with tens of thousands of
columns. 
-It would be interesting to further investigate how the complexity of
caching is influenced by the number of columns. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-22 Thread Marco Gaido
I agree with Saisai. You can also configure log4j to append anywhere else
other than the console. Many companies have their system for collecting and
monitoring logs and they just customize the log4j configuration. I am not
sure how needed this change would be.

Thanks,
Marco

Il giorno mer 22 ago 2018 alle ore 04:31 Saisai Shao 
ha scritto:

> One issue I can think of is that this "moving the driver log" in the
> application end is quite time-consuming, which will significantly delay the
> shutdown. We already suffered such "rename" problem for event log on object
> store, the moving of driver log will make the problem severe.
>
> For a vanilla Spark on yarn client application, I think user could
> redirect the console outputs to log and provides both driver log and yarn
> application log to the customers, this seems not a big overhead.
>
> Just my two cents.
>
> Thanks
> Saisai
>
> Ankur Gupta  于2018年8月22日周三 上午5:19写道:
>
>> Hi all,
>>
>> I want to highlight a problem that we face here at Cloudera and start a
>> discussion on how to go about solving it.
>>
>> *Problem Statement:*
>> Our customers reach out to us when they face problems in their Spark
>> Applications. Those problems can be related to Spark, environment issues,
>> their own code or something else altogether. A lot of times these customers
>> run their Spark Applications in Yarn Client mode, which as we all know,
>> uses a ConsoleAppender to print logs to the console. These customers
>> usually send their Yarn logs to us to troubleshoot. As you may have
>> figured, these logs do not contain driver logs and makes it difficult for
>> us to troubleshoot the issue. In that scenario our customers end up running
>> the application again, piping the output to a log file or using a local log
>> appender and then sending over that file.
>>
>> I believe that there are other users in the community who also face
>> similar problem, where the central team managing Spark clusters face
>> difficulty in helping the end users because they ran their application in
>> shell or yarn client mode (I am not sure what is the equivalent in Mesos).
>>
>> Additionally, there may be teams who want to capture all these logs so
>> that they can be analyzed at some later point in time and the fact that
>> driver logs are not a part of Yarn Logs causes them to capture only partial
>> logs or makes it difficult to capture all the logs.
>>
>> *Proposed Solution:*
>> One "low touch" approach will be to create an ApplicationListener which
>> listens for Application Start and Application End events. On Application
>> Start, this listener will append a Log Appender which writes to a local or
>> remote (eg:hdfs) log file in an application specific directory and moves
>> this to Yarn's Remote Application Dir (or equivalent Mesos Dir) on
>> application end. This way the logs will be available as part of Yarn Logs.
>>
>> I am also interested in hearing about other ideas that the community may
>> have about this. Or if someone has already solved this problem, then I
>> would like them to contribute their solution to the community.
>>
>> Thanks,
>> Ankur
>>
>