Re: GC overhead exceeded
what's is your exector memory , please share the code also On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > > HI, > > I am getting below error when running spark sql jobs. This error is thrown > after running 80% of tasks. any solution? > > spark.storage.memoryFraction=0.4 > spark.sql.shuffle.partitions=2000 > spark.default.parallelism=100 > #spark.eventLog.enabled=false > #spark.scheduler.revive.interval=1s > spark.driver.memory=8g > > > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.ArrayList.subList(ArrayList.java:955) > at java.lang.String.split(String.java:2311) > at sun.net.util.IPAddressUtil.textToNumericFormatV4( > IPAddressUtil.java:47) > at java.net.InetAddress.getAllByName(InetAddress.java:1129) > at java.net.InetAddress.getAllByName(InetAddress.java:1098) > at java.net.InetAddress.getByName(InetAddress.java:1048) > at org.apache.hadoop.net.NetUtils.normalizeHostName( > NetUtils.java:562) > at org.apache.hadoop.net.NetUtils.normalizeHostNames( > NetUtils.java:579) > at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve( > CachedDNSToSwitchMapping.java:109) > at org.apache.hadoop.yarn.util.RackResolver.coreResolve( > RackResolver.java:101) > at org.apache.hadoop.yarn.util.RackResolver.resolve( > RackResolver.java:81) > at org.apache.spark.scheduler.cluster.YarnScheduler. > getRackForHost(YarnScheduler.scala:37) > at org.apache.spark.scheduler.TaskSetManager.dequeueTask( > TaskSetManager.scala:380) > at org.apache.spark.scheduler.TaskSetManager.resourceOffer( > TaskSetManager.scala:433) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > org$apache$spark$scheduler$TaskSchedulerImpl$$ > resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276) > at scala.collection.immutable.Range.foreach$mVc$sp(Range. > scala:160) > at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$ > spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet( > TaskSchedulerImpl.scala:271) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355) > at scala.collection.IndexedSeqOptimized$class. > foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach( > ArrayOps.scala:186) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4.apply(TaskSchedulerImpl.scala:355) > at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$ > resourceOffers$4.apply(TaskSchedulerImpl.scala:352) > at scala.collection.mutable.ResizableArray$class.foreach( > ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach( > ArrayBuffer.scala:48) > at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers( > TaskSchedulerImpl.scala:352) > at org.apache.spark.scheduler.cluster. > CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$ > spark$scheduler$cluster$CoarseGrainedSchedulerBackend$ > DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222) > >
GC overhead exceeded
HI, I am getting below error when running spark sql jobs. This error is thrown after running 80% of tasks. any solution? spark.storage.memoryFraction=0.4 spark.sql.shuffle.partitions=2000 spark.default.parallelism=100 #spark.eventLog.enabled=false #spark.scheduler.revive.interval=1s spark.driver.memory=8g java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.subList(ArrayList.java:955) at java.lang.String.split(String.java:2311) at sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47) at java.net.InetAddress.getAllByName(InetAddress.java:1129) at java.net.InetAddress.getAllByName(InetAddress.java:1098) at java.net.InetAddress.getByName(InetAddress.java:1048) at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562) at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579) at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109) at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101) at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81) at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37) at org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.scheduler.TaskSchedulerImpl.org $apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)
Spark Yarn mode - unsupported exception
Hello Users, I am running into a spark issue "Unsupported major.minor version 52.0" The code I am trying to run is https://github.com/cpitman/spark-drools-example/ This code runs fine in spark local mode but fails horribly with the above exception when you submit the job in the yarn mode. spark-submit --class com.awesome.App --master yarn ./SparkDroolsExample-1.0-SNAPSHOT.jar This happens on cloudera VM and a couple of real clusters as well. I already tried to match and compile with the right JDKs etc. I also tried switching the serializers for spark. No luck. *Env Details * Using Java JDK and JRE 1.8 Using Spark 1.6.0 Using Spark 2.1.0 Please let me know if I am totally missing something. -- Sincerely, Darshan
Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type
Hi Palwell, Tried doing that, but its becoming null for all the dates after the transformation with functions. df2 = dflead.select('Enter_Date',f.to_date(df2.Enter_Date)) [image: Inline image 1] Any insight? Thanks, Aakash. On Fri, Aug 18, 2017 at 12:23 AM, Patrick Alwellwrote: > Aakash, > > I’ve had similar issues with date-time formatting. Try using the functions > library from pyspark.sql and the DF withColumns() method. > > —— > > from pyspark.sql import functions as f > > lineitem_df = lineitem_df.withColumn('shipdate',f.to_date(lineitem_ > df.shipdate)) > > —— > > You should have first ingested the column as a string; and then leveraged > the DF api to make the conversion to dateType. > > That should work. > > Kind Regards > > -Pat Alwell > > > On Aug 17, 2017, at 11:48 AM, Aakash Basu > wrote: > > Hey all, > > Thanks! I had a discussion with the person who authored that package and > informed about this bug, but in the meantime with the same thing, found a > small tweak to ensure the job is done. > > Now that is fine, I'm getting the date as a string by predefining the > Schema but I want to later convert it to a datetime format, which is making > it this - > > >>> from pyspark.sql.functions import from_unixtime, unix_timestamp > >>> df2 = dflead.select('Enter_Date', > >>> from_unixtime(unix_timestamp('Enter_Date', > 'MM/dd/yyy')).alias('date')) > > > >>> df2.show() > > > > Which is not correct (as it is converting the 15 to 0015 instead of 2015. > Do you guys think using the DateUtil package will solve this? Or any other > solution with this built-in package? > > Please help! > > Thanks, > Aakash. > > On Thu, Aug 17, 2017 at 12:01 AM, Jörn Franke > wrote: > >> You can use Apache POI DateUtil to convert double to Date ( >> https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html). >> Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/h >> adoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds. >> >> On 16. Aug 2017, at 20:15, Aakash Basu >> wrote: >> >> Hey Irving, >> >> Thanks for a quick revert. In Excel that column is purely string, I >> actually want to import that as a String and later play around the DF to >> convert it back to date type, but the API itself is not allowing me to >> dynamically assign a Schema to the DF and I'm forced to inferSchema, where >> itself, it is converting all numeric columns to double (Though, I don't >> know how then the date column is getting converted to double if it is >> string in the Excel source). >> >> Thanks, >> Aakash. >> >> >> On 16-Aug-2017 11:39 PM, "Irving Duran" wrote: >> >> I think there is a difference between the actual value in the cell and >> what Excel formats that cell. You probably want to import that field as a >> string or not have it as a date format in Excel. >> >> Just a thought >> >> >> Thank You, >> >> Irving Duran >> >> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu > > wrote: >> >>> Hey all, >>> >>> Forgot to attach the link to the overriding Schema through external >>> package's discussion. >>> >>> https://github.com/crealytics/spark-excel/pull/13 >>> >>> You can see my comment there too. >>> >>> Thanks, >>> Aakash. >>> >>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu < >>> aakash.spark@gmail.com> wrote: >>> Hi all, I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to fetch data from an excel file using *spark.read.format("com.crealytics.spark.excel")*, but it is inferring double for a date type column. The detailed description is given here (the question I posted) - https://stackoverflow.com/questions/45713699/inferschema-usi ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d Found it is a probable bug with the crealytics excel read package. Can somebody help me with a workaround for this? Thanks, Aakash. >>> >>> >> >> > >
Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type
Hey all, Thanks! I had a discussion with the person who authored that package and informed about this bug, but in the meantime with the same thing, found a small tweak to ensure the job is done. Now that is fine, I'm getting the date as a string by predefining the Schema but I want to later convert it to a datetime format, which is making it this - >>> from pyspark.sql.functions import from_unixtime, unix_timestamp >>> df2 = dflead.select('Enter_Date', from_unixtime(unix_timestamp('Enter_Date', 'MM/dd/yyy')).alias('date')) >>> df2.show() [image: Inline image 1] Which is not correct (as it is converting the 15 to 0015 instead of 2015. Do you guys think using the DateUtil package will solve this? Or any other solution with this built-in package? Please help! Thanks, Aakash. On Thu, Aug 17, 2017 at 12:01 AM, Jörn Frankewrote: > You can use Apache POI DateUtil to convert double to Date ( > https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html). > Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/ > hadoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds. > > On 16. Aug 2017, at 20:15, Aakash Basu wrote: > > Hey Irving, > > Thanks for a quick revert. In Excel that column is purely string, I > actually want to import that as a String and later play around the DF to > convert it back to date type, but the API itself is not allowing me to > dynamically assign a Schema to the DF and I'm forced to inferSchema, where > itself, it is converting all numeric columns to double (Though, I don't > know how then the date column is getting converted to double if it is > string in the Excel source). > > Thanks, > Aakash. > > > On 16-Aug-2017 11:39 PM, "Irving Duran" wrote: > > I think there is a difference between the actual value in the cell and > what Excel formats that cell. You probably want to import that field as a > string or not have it as a date format in Excel. > > Just a thought > > > Thank You, > > Irving Duran > > On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu > wrote: > >> Hey all, >> >> Forgot to attach the link to the overriding Schema through external >> package's discussion. >> >> https://github.com/crealytics/spark-excel/pull/13 >> >> You can see my comment there too. >> >> Thanks, >> Aakash. >> >> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu > > wrote: >> >>> Hi all, >>> >>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to >>> fetch data from an excel file using >>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring >>> double for a date type column. >>> >>> The detailed description is given here (the question I posted) - >>> >>> https://stackoverflow.com/questions/45713699/inferschema-usi >>> ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d >>> >>> >>> Found it is a probable bug with the crealytics excel read package. >>> >>> Can somebody help me with a workaround for this? >>> >>> Thanks, >>> Aakash. >>> >> >> > >
Re: Reading CSV with multiLine option invalidates encoding option.
For when multiLine is not set, we currently only support ascii-compatible encodings, up to my knowledge, mainly due to line separator and as I investigated in the comment. For when multiLine is set, it appears encoding is not considered. I actually meant encoding does not work at all in this case in the comment but it looks I should have been clearer on this. I have been aware of it but I personally think encoding option is rather left incomplete due to non-ascii compatible encodings and this actually brings complexity. For at least over a year, I have been (personally) wondering if we should keep extending this feature and if we could rather deprecate this option. The direction itself in your diff looks roughly correct and I can't deny that's a valid issue and fix for the current status. Workaround should be, to make a custom Hadoop input format and read it as text dataset and parse it with DataFrameReader.csv(csvDataset: Dataset[String]) for now. 2017-08-17 19:42 GMT+09:00 Han-Cheol Cho: > Hi, > > Thank you for your response. > I finally found the cause of this > > > When multiLine option is set, input file is read by > UnivocityParser.parseStream() method. > This method, in turn, calls convertStream() that initializes tokenizer > with tokenizer.beginParsing(inputStream) and parses records using > tokenizer.parseNext(). > > The problem is that beginParsing() method uses UTF-8 as its default > char-encoding. > As a result, user provided "encoding" option will be ignored. > > > When multiLine option is NOT set, on the other hand, input file is first > read and decoded from TextInputCSVDataSource.readFile() method. > Then, it is sent to UnivocityParser.parseIterator() method. > Therefore, no problem is occurred in in this case. > > > To solve this problem, I removed the call for tokenizer.beginParsing() > method in convertStream() since we cannot access options.charset variable > here. > Then, added it to two places: tokenizeStream() and parseStream() methods. > Especially, in parseStream() method, I added charset as the second > parameter for beginParsing() method. > > I attached git diff content as an attachment file. > I appreciate any comments on this. > > > Best wishes, > Han-Cheol > > > > > On Wed, Aug 16, 2017 at 3:09 PM, Takeshi Yamamuro > wrote: > >> Hi, >> >> Since the csv source currently supports ascii-compatible charset, so I >> guess shift-jis also works well. >> You could check Hyukjin's comment in https://issues.apache.org/j >> ira/browse/SPARK-21289 for more info. >> >> >> On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho >> wrote: >> >>> My apologies, >>> >>> It was a problem of our Hadoop cluster. >>> When we tested the same code on another cluster (HDP-based), it worked >>> without any problem. >>> >>> ```scala >>> ## make sjis text >>> cat a.txt >>> 8月データだけでやってみよう >>> nkf -W -s a.txt >b.txt >>> cat b.txt >>> 87n%G!<%?$@$1$G$d$C$F$_$h$& >>> nkf -s -w b.txt >>> 8月データだけでやってみよう >>> hdfs dfs -put a.txt b.txt >>> >>> ## YARN mode test >>> spark.read.option("encoding", "utf-8").csv("a.txt").show(1) >>> +--+ >>> | _c0| >>> +--+ >>> |8月データだけでやってみよう| >>> +--+ >>> >>> spark.read.option("encoding", "sjis").csv("b.txt").show(1) >>> +--+ >>> | _c0| >>> +--+ >>> |8月データだけでやってみよう| >>> +--+ >>> >>> spark.read.option("encoding", "utf-8").option("multiLine", >>> true).csv("a.txt").show(1) >>> +--+ >>> | _c0| >>> +--+ >>> |8月データだけでやってみよう| >>> +--+ >>> >>> spark.read.option("encoding", "sjis").option("multiLine", >>> true).csv("b.txt").show(1) >>> +--+ >>> | _c0| >>> +--+ >>> |8月データだけでやってみよう| >>> +--+ >>> ``` >>> >>> I am still digging the root cause and will share it later :-) >>> >>> Best wishes, >>> Han-Choel >>> >>> >>> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho >>> wrote: >>> Dear Spark ML members, I experienced a trouble in using "multiLine" option to load CSV data with Shift-JIS encoding. When option("multiLine", true) is specified, option("encoding", "encoding-name") just doesn't work anymore. In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile() method doesn't use parser.options.charset at all. object MultiLineCSVDataSource extends CSVDataSource { override val isSplitable: Boolean = false override def readFile( conf: Configuration, file: PartitionedFile, parser: UnivocityParser, schema: StructType): Iterator[InternalRow] = { UnivocityParser.parseStream( CodecStreams.createInputStreamWithCloseResource(conf, file.filePath), parser.options.headerFlag, parser, schema) } ... On the other hand,
Spark 2 | Java | Dataset
Hey, I was wondering if it would make sense to have a Dataset of something else than Row? Does anyone has an example (in Java) or use case? My use case would be to use Spark on existing objects we have and benefit from the distributed processing on those objects. jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Reading CSV with multiLine option invalidates encoding option.
Hi, Thank you for your response. I finally found the cause of this When multiLine option is set, input file is read by UnivocityParser.parseStream() method. This method, in turn, calls convertStream() that initializes tokenizer with tokenizer.beginParsing(inputStream) and parses records using tokenizer.parseNext(). The problem is that beginParsing() method uses UTF-8 as its default char-encoding. As a result, user provided "encoding" option will be ignored. When multiLine option is NOT set, on the other hand, input file is first read and decoded from TextInputCSVDataSource.readFile() method. Then, it is sent to UnivocityParser.parseIterator() method. Therefore, no problem is occurred in in this case. To solve this problem, I removed the call for tokenizer.beginParsing() method in convertStream() since we cannot access options.charset variable here. Then, added it to two places: tokenizeStream() and parseStream() methods. Especially, in parseStream() method, I added charset as the second parameter for beginParsing() method. I attached git diff content as an attachment file. I appreciate any comments on this. Best wishes, Han-Cheol On Wed, Aug 16, 2017 at 3:09 PM, Takeshi Yamamurowrote: > Hi, > > Since the csv source currently supports ascii-compatible charset, so I > guess shift-jis also works well. > You could check Hyukjin's comment in https://issues.apache.org/ > jira/browse/SPARK-21289 for more info. > > > On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho wrote: > >> My apologies, >> >> It was a problem of our Hadoop cluster. >> When we tested the same code on another cluster (HDP-based), it worked >> without any problem. >> >> ```scala >> ## make sjis text >> cat a.txt >> 8月データだけでやってみよう >> nkf -W -s a.txt >b.txt >> cat b.txt >> 87n%G!<%?$@$1$G$d$C$F$_$h$& >> nkf -s -w b.txt >> 8月データだけでやってみよう >> hdfs dfs -put a.txt b.txt >> >> ## YARN mode test >> spark.read.option("encoding", "utf-8").csv("a.txt").show(1) >> +--+ >> | _c0| >> +--+ >> |8月データだけでやってみよう| >> +--+ >> >> spark.read.option("encoding", "sjis").csv("b.txt").show(1) >> +--+ >> | _c0| >> +--+ >> |8月データだけでやってみよう| >> +--+ >> >> spark.read.option("encoding", "utf-8").option("multiLine", >> true).csv("a.txt").show(1) >> +--+ >> | _c0| >> +--+ >> |8月データだけでやってみよう| >> +--+ >> >> spark.read.option("encoding", "sjis").option("multiLine", >> true).csv("b.txt").show(1) >> +--+ >> | _c0| >> +--+ >> |8月データだけでやってみよう| >> +--+ >> ``` >> >> I am still digging the root cause and will share it later :-) >> >> Best wishes, >> Han-Choel >> >> >> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho >> wrote: >> >>> Dear Spark ML members, >>> >>> >>> I experienced a trouble in using "multiLine" option to load CSV data >>> with Shift-JIS encoding. >>> When option("multiLine", true) is specified, option("encoding", >>> "encoding-name") just doesn't work anymore. >>> >>> >>> In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile() >>> method doesn't use parser.options.charset at all. >>> >>> object MultiLineCSVDataSource extends CSVDataSource { >>> override val isSplitable: Boolean = false >>> >>> override def readFile( >>> conf: Configuration, >>> file: PartitionedFile, >>> parser: UnivocityParser, >>> schema: StructType): Iterator[InternalRow] = { >>> UnivocityParser.parseStream( >>> CodecStreams.createInputStreamWithCloseResource(conf, >>> file.filePath), >>> parser.options.headerFlag, >>> parser, >>> schema) >>> } >>> ... >>> >>> On the other hand, TextInputCSVDataSource.readFile() method uses it: >>> >>> override def readFile( >>> conf: Configuration, >>> file: PartitionedFile, >>> parser: UnivocityParser, >>> schema: StructType): Iterator[InternalRow] = { >>> val lines = { >>> val linesReader = new HadoopFileLinesReader(file, conf) >>> Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ >>> => linesReader.close())) >>> linesReader.map { line => >>> new String(line.getBytes, 0, line.getLength, >>> parser.options.charset)// < charset option is used here. >>> } >>> } >>> >>> val shouldDropHeader = parser.options.headerFlag && file.start == 0 >>> UnivocityParser.parseIterator(lines, shouldDropHeader, parser, >>> schema) >>> } >>> >>> >>> It seems like a bug. >>> Is there anyone who had the same problem before? >>> >>> >>> Best wishes, >>> Han-Cheol >>> >>> -- >>> == >>> Han-Cheol Cho, Ph.D. >>> Data scientist, Data Science Team, Data Laboratory >>> NHN Techorus Corp. >>> >>> Homepage: https://sites.google.com/site/priancho/ >>> == >>> >> >> >> >> -- >>
Working with hadoop har file in spark
Hi I put million files into a har archive on hdfs. I d'like to iterate over their file paths, and read them. (Basically they are pdf, and I want to transform them into text with apache pdfbox) My first attempts has been to list them with hadoop command `hdfs dfs -ls har:///user//har/pdf.har` and this works fine. However, when I try to replicate this in spark, I get an error: ``` val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf) val hdfs = FileSystem.get(hconf) val test = hdfs.listFiles(new Path("har:///user//har/pdf.har"), false) java.lang.IllegalArgumentException: Wrong FS: har:/user//har/pdf.har, expected: hdfs://: at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:661) ``` However, I had been able to use the `sc.textFile` without problem: ``` val test = sc.textFile("har:///user//har/pdf.har").count 8000 ``` -- 1) Is it easily solvable ? 2) Do I need to implement my own pdfFile reader, inspired from textFile ? 2) If not, does har the best way ? I have been looking at AVRO too Thanks for any advice, -- Nicolas - To unsubscribe e-mail: user-unsubscr...@spark.apache.org