Re: Classic logistic regression missing !!! (Generalized linear models)
So the LogisticRegression with regParam and elasticNetParam set to 0 is not what you are looking for? https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression .setRegParam(0.0) .setElasticNetParam(0.0) Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado < pikufolg...@gmail.com>: > Hi, > > I would like to carry out a classic logistic regression analysis. In other > words, without using penalised regression ("glmnet" in R). I have read the > documentation and am not able to find this kind of models. > > Is it possible to estimate this? In R the name of the function is "glm". > > Best regards > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Classic logistic regression missing !!! (Generalized linear models)
Hi, I would like to carry out a classic logistic regression analysis. In other words, without using penalised regression ("glmnet" in R). I have read the documentation and am not able to find this kind of models. Is it possible to estimate this? In R the name of the function is "glm". Best regards -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[Spark Structured Streaming] Running out of disk quota due to /work/tmp
We have a Spark Structured Streaming job which runs out of disk quota after some days. The primary reason is there are bunch of empty folders that are getting created in the /work/tmp directory. Any idea how to prune them? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
re: yarn resource overcommit: cpu / vcores
Hi there, is there any best practice guideline on yarn resource overcommit with cpu / vcores, such as yarn config options, candidate cases ideal for overcommiting vcores etc.? this slide below (from 2016) seems to address the memory overcommit topic and hint a "future" topic on cpu overcommit: https://www.slideshare.net/HadoopSummit/investing-the-effects-of-overcommitting-yarn-resources any help/hint would be very much appreciated! Regards, Peter FYI: I have a system with 80 vcores and a relatively light spark streaming workload. overcomming the vocore resource (here 100) seems to help the average spark batch time. need more understanding on this practice. Skylake (1 x 900K msg/sec) total batch# (avg) avg batch time in ms (avg) avg user cpu (%) nw read (mb/sec) 70vocres 178.20 8154.69 n/a n/a 80vocres 177.40 7865.44 27.85 222.31 100vcores 177.00 7,209.37 30.02 220.86
Re: Process Million Binary Files
Hi Joel I built such pipeline to transform pdf-> text https://github.com/EDS-APHP/SparkPdfExtractor You can take a look It transforms 20M pdfs in 2 hours on a 5 node spark cluster Le 2018-10-10 23:56, Joel D a écrit : > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m > trying with some 40k files. I’m using binaryFiles api with which > I’m facing couple of issues: > > 1. It creates only 4 tasks and I can’t seem to increase the > parallelism there. > 2. It took 2276 seconds and that means for millions of files it will > take ages to complete. I’m also expecting it to fail for million > records with some timeout or gc overhead exception. > > Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache > > Val fileContentRdd = files.map(file => myFunc(file) > > Do you have any guidance on how I can process millions of files using > binaryFiles api? > > How can I increase the number of tasks/parallelism during the creation > of files rdd? > > Thanks - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Process Million Binary Files
I believe your use case can be better covered with an own data source reading PDF files. On Big Data platforms in general you have the issue that individual PDF files are very small and are a lot of them - this is not very efficient for those platforms. That could be also one source of your performance problems (not necessarily the parallelism). You would need to make 1 mio requests to the namenode (this could be also interpreted as a Denial-of-Service attack). Historically, Hadoop Archives were introduced to address this problem: https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html You can try also to store them first in Hbase or in the future on Hadoop Ozone. That could make a higher parallelism possible „out of the box“. > Am 10.10.2018 um 23:56 schrieb Joel D : > > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m trying with > some 40k files. I’m using binaryFiles api with which I’m facing couple of > issues: > > 1. It creates only 4 tasks and I can’t seem to increase the parallelism > there. > 2. It took 2276 seconds and that means for millions of files it will take > ages to complete. I’m also expecting it to fail for million records with some > timeout or gc overhead exception. > > Val files = sparkSession.sparkContext.binaryFiles(filePath, 200).cache > > Val fileContentRdd = files.map(file => myFunc(file) > > > > Do you have any guidance on how I can process millions of files using > binaryFiles api? > > How can I increase the number of tasks/parallelism during the creation of > files rdd? > > Thanks >