Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not what you are looking for? https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression .setRegParam(0.0) .setElasticNetParam(0.0) Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado <

Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread pikufolgado
Hi, I would like to carry out a classic logistic regression analysis. In other words, without using penalised regression ("glmnet" in R). I have read the documentation and am not able to find this kind of models. Is it possible to estimate this? In R the name of the function is "glm". Best regar

[Spark Structured Streaming] Running out of disk quota due to /work/tmp

2018-10-11 Thread subramgr
We have a Spark Structured Streaming job which runs out of disk quota after some days. The primary reason is there are bunch of empty folders that are getting created in the /work/tmp directory. Any idea how to prune them? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --

re: yarn resource overcommit: cpu / vcores

2018-10-11 Thread Peter Liu
Hi there, is there any best practice guideline on yarn resource overcommit with cpu / vcores, such as yarn config options, candidate cases ideal for overcommiting vcores etc.? this slide below (from 2016) seems to address the memory overcommit topic and hint a "future" topic on cpu overcommit: ht

Re: Process Million Binary Files

2018-10-11 Thread Nicolas PARIS
Hi Joel I built such pipeline to transform pdf-> text https://github.com/EDS-APHP/SparkPdfExtractor You can take a look It transforms 20M pdfs in 2 hours on a 5 node spark cluster Le 2018-10-10 23:56, Joel D a écrit : > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m