Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not
what you are looking for?

https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression

  .setRegParam(0.0)
  .setElasticNetParam(0.0)


Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado <
pikufolg...@gmail.com>:

> Hi,
>
> I would like to carry out a classic logistic regression analysis. In other
> words, without using penalised regression ("glmnet" in R). I have read the
> documentation and am not able to find this kind of models.
>
> Is it possible to estimate this? In R the name of the function is "glm".
>
> Best regards
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread pikufolgado
Hi,

I would like to carry out a classic logistic regression analysis. In other
words, without using penalised regression ("glmnet" in R). I have read the
documentation and am not able to find this kind of models.

Is it possible to estimate this? In R the name of the function is "glm".

Best regards



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark Structured Streaming] Running out of disk quota due to /work/tmp

2018-10-11 Thread subramgr
We have a Spark Structured Streaming job which runs out of disk quota after
some days.

The primary reason is there are bunch of empty folders that are getting
created in the /work/tmp directory. 

Any idea how to prune them?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



re: yarn resource overcommit: cpu / vcores

2018-10-11 Thread Peter Liu
Hi there,

is there any best practice guideline on yarn resource overcommit with cpu /
vcores, such as yarn config options, candidate cases ideal for
overcommiting vcores etc.?

this slide below (from 2016) seems to address the memory overcommit topic
and hint a "future" topic on cpu overcommit:
https://www.slideshare.net/HadoopSummit/investing-the-effects-of-overcommitting-yarn-resources

any help/hint would be very much appreciated!

Regards,

Peter

FYI:
I have a system with 80 vcores and a relatively light spark streaming
workload. overcomming the vocore resource (here 100) seems to help the
average spark batch time. need more understanding on this practice.
Skylake (1 x 900K msg/sec) total batch# (avg) avg batch time in ms (avg) avg
user cpu (%) nw read (mb/sec)
70vocres 178.20 8154.69 n/a n/a
80vocres 177.40 7865.44 27.85 222.31
100vcores 177.00 7,209.37 30.02 220.86


Re: Process Million Binary Files

2018-10-11 Thread Nicolas PARIS
Hi Joel

I built such pipeline to transform pdf-> text
https://github.com/EDS-APHP/SparkPdfExtractor
You can take a look

It transforms 20M pdfs in 2 hours on a 5 node spark cluster 

Le 2018-10-10 23:56, Joel D a écrit :
> Hi,
> 
> I need to process millions of PDFs in hdfs using spark. First I’m
> trying with some 40k files. I’m using binaryFiles api with which
> I’m facing couple of issues:
> 
> 1. It creates only 4 tasks and I can’t seem to increase the
> parallelism there. 
> 2. It took 2276 seconds and that means for millions of files it will
> take ages to complete. I’m also expecting it to fail for million
> records with some timeout or gc overhead exception.
> 
> Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache
> 
> Val fileContentRdd = files.map(file => myFunc(file)
> 
> Do you have any guidance on how I can process millions of files using
> binaryFiles api?
> 
> How can I increase the number of tasks/parallelism during the creation
> of files rdd?
> 
> Thanks

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Process Million Binary Files

2018-10-11 Thread Jörn Franke
I believe your use case can be better covered with an own data source reading 
PDF files.

 On Big Data platforms in general you have the issue that individual PDF files 
are very small and are a lot of them - this is not very efficient for those 
platforms. That could be also one source of your performance problems (not 
necessarily the parallelism). You would need to make 1 mio requests to the 
namenode (this could be also interpreted as a Denial-of-Service attack). 
Historically, Hadoop Archives were introduced to address this problem: 
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html

You can try also to store them first in Hbase or in the future on Hadoop Ozone. 
That could make a higher parallelism possible „out of the box“. 

> Am 10.10.2018 um 23:56 schrieb Joel D :
> 
> Hi,
> 
> I need to process millions of PDFs in hdfs using spark. First I’m trying with 
> some 40k files. I’m using binaryFiles api with which I’m facing couple of 
> issues:
> 
> 1. It creates only 4 tasks and I can’t seem to increase the parallelism 
> there. 
> 2. It took 2276 seconds and that means for millions of files it will take 
> ages to complete. I’m also expecting it to fail for million records with some 
> timeout or gc overhead exception.
> 
> Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache
> 
> Val fileContentRdd = files.map(file => myFunc(file)
> 
> 
> 
> Do you have any guidance on how I can process millions of files using 
> binaryFiles api?
> 
> How can I increase the number of tasks/parallelism during the creation of 
> files rdd?
> 
> Thanks
>