Re: GC overhead exceeded

2017-08-17 Thread Pralabh Kumar
what's is your exector memory , please share the code also

On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

>
> HI,
>
> I am getting below error when running spark sql jobs. This error is thrown
> after running 80% of tasks. any solution?
>
> spark.storage.memoryFraction=0.4
> spark.sql.shuffle.partitions=2000
> spark.default.parallelism=100
> #spark.eventLog.enabled=false
> #spark.scheduler.revive.interval=1s
> spark.driver.memory=8g
>
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.ArrayList.subList(ArrayList.java:955)
> at java.lang.String.split(String.java:2311)
> at sun.net.util.IPAddressUtil.textToNumericFormatV4(
> IPAddressUtil.java:47)
> at java.net.InetAddress.getAllByName(InetAddress.java:1129)
> at java.net.InetAddress.getAllByName(InetAddress.java:1098)
> at java.net.InetAddress.getByName(InetAddress.java:1048)
> at org.apache.hadoop.net.NetUtils.normalizeHostName(
> NetUtils.java:562)
> at org.apache.hadoop.net.NetUtils.normalizeHostNames(
> NetUtils.java:579)
> at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(
> CachedDNSToSwitchMapping.java:109)
> at org.apache.hadoop.yarn.util.RackResolver.coreResolve(
> RackResolver.java:101)
> at org.apache.hadoop.yarn.util.RackResolver.resolve(
> RackResolver.java:81)
> at org.apache.spark.scheduler.cluster.YarnScheduler.
> getRackForHost(YarnScheduler.scala:37)
> at org.apache.spark.scheduler.TaskSetManager.dequeueTask(
> TaskSetManager.scala:380)
> at org.apache.spark.scheduler.TaskSetManager.resourceOffer(
> TaskSetManager.scala:433)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> org$apache$spark$scheduler$TaskSchedulerImpl$$
> resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.
> scala:160)
> at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$
> spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(
> TaskSchedulerImpl.scala:271)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355)
> at scala.collection.IndexedSeqOptimized$class.
> foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(
> ArrayOps.scala:186)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4.apply(TaskSchedulerImpl.scala:355)
> at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$
> resourceOffers$4.apply(TaskSchedulerImpl.scala:352)
> at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(
> ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(
> TaskSchedulerImpl.scala:352)
> at org.apache.spark.scheduler.cluster.
> CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$
> spark$scheduler$cluster$CoarseGrainedSchedulerBackend$
> DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)
>
>


GC overhead exceeded

2017-08-17 Thread KhajaAsmath Mohammed
HI,

I am getting below error when running spark sql jobs. This error is thrown
after running 80% of tasks. any solution?

spark.storage.memoryFraction=0.4
spark.sql.shuffle.partitions=2000
spark.default.parallelism=100
#spark.eventLog.enabled=false
#spark.scheduler.revive.interval=1s
spark.driver.memory=8g


java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.ArrayList.subList(ArrayList.java:955)
at java.lang.String.split(String.java:2311)
at
sun.net.util.IPAddressUtil.textToNumericFormatV4(IPAddressUtil.java:47)
at java.net.InetAddress.getAllByName(InetAddress.java:1129)
at java.net.InetAddress.getAllByName(InetAddress.java:1098)
at java.net.InetAddress.getByName(InetAddress.java:1048)
at
org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:562)
at
org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:579)
at
org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
at
org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
at
org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
at
org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
at
org.apache.spark.scheduler.TaskSetManager.dequeueTask(TaskSetManager.scala:380)
at
org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:433)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:276)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.scheduler.TaskSchedulerImpl.org
$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:271)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:357)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:355)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:355)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:352)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:352)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:222)


Spark Yarn mode - unsupported exception

2017-08-17 Thread Darshan Pandya
Hello Users,

I am running into a spark issue "Unsupported major.minor version 52.0"

The code I am trying to run is
https://github.com/cpitman/spark-drools-example/


This code runs fine in spark local mode but fails horribly with the above
exception when you submit the job in the yarn mode.


spark-submit --class com.awesome.App --master yarn
./SparkDroolsExample-1.0-SNAPSHOT.jar


This happens on cloudera VM and a couple of real clusters as well.

I already tried to match and compile with the right JDKs etc.

I also tried switching the serializers for spark. No luck.


*Env Details *


Using Java JDK and JRE 1.8

Using Spark 1.6.0

Using Spark 2.1.0

Please let me know if I am totally missing something.




-- 
Sincerely,
Darshan


Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
Hi Palwell,

Tried doing that, but its becoming null for all the dates after the
transformation with functions.

df2 = dflead.select('Enter_Date',f.to_date(df2.Enter_Date))


[image: Inline image 1]

Any insight?

Thanks,
Aakash.

On Fri, Aug 18, 2017 at 12:23 AM, Patrick Alwell 
wrote:

> Aakash,
>
> I’ve had similar issues with date-time formatting. Try using the functions
> library from pyspark.sql and the DF withColumns() method.
>
> ——
>
> from pyspark.sql import functions as f
>
> lineitem_df = lineitem_df.withColumn('shipdate',f.to_date(lineitem_
> df.shipdate))
>
> ——
>
> You should have first ingested the column as a string; and then leveraged
> the DF api to make the conversion to dateType.
>
> That should work.
>
> Kind Regards
>
> -Pat Alwell
>
>
> On Aug 17, 2017, at 11:48 AM, Aakash Basu 
> wrote:
>
> Hey all,
>
> Thanks! I had a discussion with the person who authored that package and
> informed about this bug, but in the meantime with the same thing, found a
> small tweak to ensure the job is done.
>
> Now that is fine, I'm getting the date as a string by predefining the
> Schema but I want to later convert it to a datetime format, which is making
> it this -
>
> >>> from pyspark.sql.functions import from_unixtime, unix_timestamp
> >>> df2 = dflead.select('Enter_Date', 
> >>> from_unixtime(unix_timestamp('Enter_Date',
> 'MM/dd/yyy')).alias('date'))
>
>
> >>> df2.show()
>
> 
>
> Which is not correct (as it is converting the 15 to 0015 instead of 2015.
> Do you guys think using the DateUtil package will solve this? Or any other
> solution with this built-in package?
>
> Please help!
>
> Thanks,
> Aakash.
>
> On Thu, Aug 17, 2017 at 12:01 AM, Jörn Franke 
> wrote:
>
>> You can use Apache POI DateUtil to convert double to Date (
>> https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html).
>> Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/h
>> adoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds.
>>
>> On 16. Aug 2017, at 20:15, Aakash Basu 
>> wrote:
>>
>> Hey Irving,
>>
>> Thanks for a quick revert. In Excel that column is purely string, I
>> actually want to import that as a String and later play around the DF to
>> convert it back to date type, but the API itself is not allowing me to
>> dynamically assign a Schema to the DF and I'm forced to inferSchema, where
>> itself, it is converting all numeric columns to double (Though, I don't
>> know how then the date column is getting converted to double if it is
>> string in the Excel source).
>>
>> Thanks,
>> Aakash.
>>
>>
>> On 16-Aug-2017 11:39 PM, "Irving Duran"  wrote:
>>
>> I think there is a difference between the actual value in the cell and
>> what Excel formats that cell.  You probably want to import that field as a
>> string or not have it as a date format in Excel.
>>
>> Just a thought
>>
>>
>> Thank You,
>>
>> Irving Duran
>>
>> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu > > wrote:
>>
>>> Hey all,
>>>
>>> Forgot to attach the link to the overriding Schema through external
>>> package's discussion.
>>>
>>> https://github.com/crealytics/spark-excel/pull/13
>>>
>>> You can see my comment there too.
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu <
>>> aakash.spark@gmail.com> wrote:
>>>
 Hi all,

 I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
 fetch data from an excel file using
 *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
 double for a date type column.

 The detailed description is given here (the question I posted) -

 https://stackoverflow.com/questions/45713699/inferschema-usi
 ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d


 Found it is a probable bug with the crealytics excel read package.

 Can somebody help me with a workaround for this?

 Thanks,
 Aakash.

>>>
>>>
>>
>>
>
>


Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
Hey all,

Thanks! I had a discussion with the person who authored that package and
informed about this bug, but in the meantime with the same thing, found a
small tweak to ensure the job is done.

Now that is fine, I'm getting the date as a string by predefining the
Schema but I want to later convert it to a datetime format, which is making
it this -

>>> from pyspark.sql.functions import from_unixtime, unix_timestamp
>>> df2 = dflead.select('Enter_Date',
from_unixtime(unix_timestamp('Enter_Date', 'MM/dd/yyy')).alias('date'))


>>> df2.show()

[image: Inline image 1]

Which is not correct (as it is converting the 15 to 0015 instead of 2015.
Do you guys think using the DateUtil package will solve this? Or any other
solution with this built-in package?

Please help!

Thanks,
Aakash.

On Thu, Aug 17, 2017 at 12:01 AM, Jörn Franke  wrote:

> You can use Apache POI DateUtil to convert double to Date (
> https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html).
> Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/
> hadoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds.
>
> On 16. Aug 2017, at 20:15, Aakash Basu  wrote:
>
> Hey Irving,
>
> Thanks for a quick revert. In Excel that column is purely string, I
> actually want to import that as a String and later play around the DF to
> convert it back to date type, but the API itself is not allowing me to
> dynamically assign a Schema to the DF and I'm forced to inferSchema, where
> itself, it is converting all numeric columns to double (Though, I don't
> know how then the date column is getting converted to double if it is
> string in the Excel source).
>
> Thanks,
> Aakash.
>
>
> On 16-Aug-2017 11:39 PM, "Irving Duran"  wrote:
>
> I think there is a difference between the actual value in the cell and
> what Excel formats that cell.  You probably want to import that field as a
> string or not have it as a date format in Excel.
>
> Just a thought
>
>
> Thank You,
>
> Irving Duran
>
> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu 
> wrote:
>
>> Hey all,
>>
>> Forgot to attach the link to the overriding Schema through external
>> package's discussion.
>>
>> https://github.com/crealytics/spark-excel/pull/13
>>
>> You can see my comment there too.
>>
>> Thanks,
>> Aakash.
>>
>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu > > wrote:
>>
>>> Hi all,
>>>
>>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>>> fetch data from an excel file using
>>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>>> double for a date type column.
>>>
>>> The detailed description is given here (the question I posted) -
>>>
>>> https://stackoverflow.com/questions/45713699/inferschema-usi
>>> ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>>
>>>
>>> Found it is a probable bug with the crealytics excel read package.
>>>
>>> Can somebody help me with a workaround for this?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>
>>
>
>


Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Hyukjin Kwon
For when multiLine is not set, we currently only support ascii-compatible
encodings, up to my knowledge, mainly due to line separator and as I
investigated in the comment.
For when multiLine is set, it appears encoding is not considered. I
actually meant encoding does not work at all in this case in the comment
but it looks I should have been clearer on this.

I have been aware of it but I personally think encoding option is rather
left incomplete due to non-ascii compatible encodings and this actually
brings complexity. For at least over a year, I have been (personally)
wondering if we should keep extending this feature and if we could rather
deprecate this option.

The direction itself in your diff looks roughly correct and I can't deny
that's a valid issue and fix for the current status.

Workaround should be, to make a custom Hadoop input format and read it as
text dataset and parse it with DataFrameReader.csv(csvDataset:
Dataset[String]) for now.



2017-08-17 19:42 GMT+09:00 Han-Cheol Cho :

> Hi,
>
> Thank you for your response.
> I finally found the cause of this
>
>
> When multiLine option is set, input file is read by
> UnivocityParser.parseStream() method.
> This method, in turn, calls convertStream() that initializes tokenizer
> with tokenizer.beginParsing(inputStream) and parses records using
> tokenizer.parseNext().
>
> The problem is that beginParsing() method uses UTF-8 as its default
> char-encoding.
> As a result, user provided "encoding" option will be ignored.
>
>
> When multiLine option is NOT set, on the other hand, input file is first
> read and decoded from TextInputCSVDataSource.readFile() method.
> Then, it is sent to UnivocityParser.parseIterator() method.
> Therefore, no problem is occurred in in this case.
>
>
> To solve this problem, I removed the call for tokenizer.beginParsing()
> method in convertStream() since we cannot access options.charset variable
> here.
> Then, added it to two places: tokenizeStream() and parseStream() methods.
> Especially, in parseStream() method, I added charset as the second
> parameter for beginParsing() method.
>
> I attached git diff content as an attachment file.
> I appreciate any comments on this.
>
>
> Best wishes,
> Han-Cheol
>
>
>
>
> On Wed, Aug 16, 2017 at 3:09 PM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> Since the csv source currently supports ascii-compatible charset, so I
>> guess shift-jis also works well.
>> You could check Hyukjin's comment in https://issues.apache.org/j
>> ira/browse/SPARK-21289 for more info.
>>
>>
>> On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho 
>> wrote:
>>
>>> My apologies,
>>>
>>> It was a problem of our Hadoop cluster.
>>> When we tested the same code on another cluster (HDP-based), it worked
>>> without any problem.
>>>
>>> ```scala
>>> ## make sjis text
>>> cat a.txt
>>> 8月データだけでやってみよう
>>> nkf -W -s a.txt >b.txt
>>> cat b.txt
>>> 87n%G!<%?$@$1$G$d$C$F$_$h$&
>>> nkf -s -w b.txt
>>> 8月データだけでやってみよう
>>> hdfs dfs -put a.txt b.txt
>>>
>>> ## YARN mode test
>>> spark.read.option("encoding", "utf-8").csv("a.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>>
>>> spark.read.option("encoding", "sjis").csv("b.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>>
>>> spark.read.option("encoding", "utf-8").option("multiLine",
>>> true).csv("a.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>>
>>> spark.read.option("encoding", "sjis").option("multiLine",
>>> true).csv("b.txt").show(1)
>>> +--+
>>> |   _c0|
>>> +--+
>>> |8月データだけでやってみよう|
>>> +--+
>>> ```
>>>
>>> I am still digging the root cause and will share it later :-)
>>>
>>> Best wishes,
>>> Han-Choel
>>>
>>>
>>> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho 
>>> wrote:
>>>
 Dear Spark ML members,


 I experienced a trouble in using "multiLine" option to load CSV data
 with Shift-JIS encoding.
 When option("multiLine", true) is specified, option("encoding",
 "encoding-name") just doesn't work anymore.


 In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
 method doesn't use parser.options.charset at all.

 object MultiLineCSVDataSource extends CSVDataSource {
   override val isSplitable: Boolean = false

   override def readFile(
   conf: Configuration,
   file: PartitionedFile,
   parser: UnivocityParser,
   schema: StructType): Iterator[InternalRow] = {
 UnivocityParser.parseStream(
   CodecStreams.createInputStreamWithCloseResource(conf,
 file.filePath),
   parser.options.headerFlag,
   parser,
   schema)
   }
   ...

 On the other hand, 

Spark 2 | Java | Dataset

2017-08-17 Thread Jean Georges Perrin
Hey,

I was wondering if it would make sense to have a Dataset of something else than 
Row?

Does anyone has an example (in Java) or use case?

My use case would be to use Spark on existing objects we have and benefit from 
the distributed processing on those objects.

jg



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Han-Cheol Cho
Hi,

Thank you for your response.
I finally found the cause of this


When multiLine option is set, input file is read by
UnivocityParser.parseStream() method.
This method, in turn, calls convertStream() that initializes tokenizer with
tokenizer.beginParsing(inputStream) and parses records using
tokenizer.parseNext().

The problem is that beginParsing() method uses UTF-8 as its default
char-encoding.
As a result, user provided "encoding" option will be ignored.


When multiLine option is NOT set, on the other hand, input file is first
read and decoded from TextInputCSVDataSource.readFile() method.
Then, it is sent to UnivocityParser.parseIterator() method.
Therefore, no problem is occurred in in this case.


To solve this problem, I removed the call for tokenizer.beginParsing()
method in convertStream() since we cannot access options.charset variable
here.
Then, added it to two places: tokenizeStream() and parseStream() methods.
Especially, in parseStream() method, I added charset as the second
parameter for beginParsing() method.

I attached git diff content as an attachment file.
I appreciate any comments on this.


Best wishes,
Han-Cheol




On Wed, Aug 16, 2017 at 3:09 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Since the csv source currently supports ascii-compatible charset, so I
> guess shift-jis also works well.
> You could check Hyukjin's comment in https://issues.apache.org/
> jira/browse/SPARK-21289 for more info.
>
>
> On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho  wrote:
>
>> My apologies,
>>
>> It was a problem of our Hadoop cluster.
>> When we tested the same code on another cluster (HDP-based), it worked
>> without any problem.
>>
>> ```scala
>> ## make sjis text
>> cat a.txt
>> 8月データだけでやってみよう
>> nkf -W -s a.txt >b.txt
>> cat b.txt
>> 87n%G!<%?$@$1$G$d$C$F$_$h$&
>> nkf -s -w b.txt
>> 8月データだけでやってみよう
>> hdfs dfs -put a.txt b.txt
>>
>> ## YARN mode test
>> spark.read.option("encoding", "utf-8").csv("a.txt").show(1)
>> +--+
>> |   _c0|
>> +--+
>> |8月データだけでやってみよう|
>> +--+
>>
>> spark.read.option("encoding", "sjis").csv("b.txt").show(1)
>> +--+
>> |   _c0|
>> +--+
>> |8月データだけでやってみよう|
>> +--+
>>
>> spark.read.option("encoding", "utf-8").option("multiLine",
>> true).csv("a.txt").show(1)
>> +--+
>> |   _c0|
>> +--+
>> |8月データだけでやってみよう|
>> +--+
>>
>> spark.read.option("encoding", "sjis").option("multiLine",
>> true).csv("b.txt").show(1)
>> +--+
>> |   _c0|
>> +--+
>> |8月データだけでやってみよう|
>> +--+
>> ```
>>
>> I am still digging the root cause and will share it later :-)
>>
>> Best wishes,
>> Han-Choel
>>
>>
>> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho 
>> wrote:
>>
>>> Dear Spark ML members,
>>>
>>>
>>> I experienced a trouble in using "multiLine" option to load CSV data
>>> with Shift-JIS encoding.
>>> When option("multiLine", true) is specified, option("encoding",
>>> "encoding-name") just doesn't work anymore.
>>>
>>>
>>> In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
>>> method doesn't use parser.options.charset at all.
>>>
>>> object MultiLineCSVDataSource extends CSVDataSource {
>>>   override val isSplitable: Boolean = false
>>>
>>>   override def readFile(
>>>   conf: Configuration,
>>>   file: PartitionedFile,
>>>   parser: UnivocityParser,
>>>   schema: StructType): Iterator[InternalRow] = {
>>> UnivocityParser.parseStream(
>>>   CodecStreams.createInputStreamWithCloseResource(conf,
>>> file.filePath),
>>>   parser.options.headerFlag,
>>>   parser,
>>>   schema)
>>>   }
>>>   ...
>>>
>>> On the other hand, TextInputCSVDataSource.readFile() method uses it:
>>>
>>>   override def readFile(
>>>   conf: Configuration,
>>>   file: PartitionedFile,
>>>   parser: UnivocityParser,
>>>   schema: StructType): Iterator[InternalRow] = {
>>> val lines = {
>>>   val linesReader = new HadoopFileLinesReader(file, conf)
>>>   Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_
>>> => linesReader.close()))
>>>   linesReader.map { line =>
>>> new String(line.getBytes, 0, line.getLength,
>>> parser.options.charset)// < charset option is used here.
>>>   }
>>> }
>>>
>>> val shouldDropHeader = parser.options.headerFlag && file.start == 0
>>> UnivocityParser.parseIterator(lines, shouldDropHeader, parser,
>>> schema)
>>>   }
>>>
>>>
>>> It seems like a bug.
>>> Is there anyone who had the same problem before?
>>>
>>>
>>> Best wishes,
>>> Han-Cheol
>>>
>>> --
>>> ==
>>> Han-Cheol Cho, Ph.D.
>>> Data scientist, Data Science Team, Data Laboratory
>>> NHN Techorus Corp.
>>>
>>> Homepage: https://sites.google.com/site/priancho/
>>> ==
>>>
>>
>>
>>
>> --
>> 

Working with hadoop har file in spark

2017-08-17 Thread Nicolas Paris
Hi

I put million files into a har archive on hdfs. I d'like to iterate over
their file paths, and read them. (Basically they are pdf, and I want to
transform them into text with apache pdfbox)

My first attempts has been to list them with hadoop command 
`hdfs dfs -ls har:///user//har/pdf.har` and this works fine.
However, when I try to replicate this in spark, I get an error:

```  
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val test = hdfs.listFiles(new Path("har:///user//har/pdf.har"), false)
java.lang.IllegalArgumentException: Wrong FS:
har:/user//har/pdf.har, expected: hdfs://:
at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:661)
```  

However, I had been able to use the `sc.textFile` without problem:

```
val test = sc.textFile("har:///user//har/pdf.har").count
8000
```  

--
1) Is it easily solvable ?
2) Do I need to implement my own pdfFile reader, inspired from textFile ?
2) If not, does har the best way ? I have been looking at AVRO too

Thanks for any advice,

-- 
Nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org