[jira] [Created] (SPARK-26971) How to read delimiter (Cedilla) in spark RDD and Dataframes

Babu (JIRA) Fri, 22 Feb 2019 06:18:17 -0800

Babu created SPARK-26971:
----------------------------

             Summary: How to read delimiter (Cedilla) in spark RDD and 
Dataframes
                 Key: SPARK-26971
                 URL: https://issues.apache.org/jira/browse/SPARK-26971
             Project: Spark
          Issue Type: Question
          Components: PySpark
    Affects Versions: 1.6.0
            Reporter: Babu



 

I am trying to read a cedilla delimited HDFS Text file. I am getting the below 
error, did any one face similar issue?

{{hadoop fs -cat test_file.dat }}

{{1ÇCelvelandÇOhio 2ÇDurhamÇNC 3ÇDallasÇTexas }}

{{>>> rdd = sc.textFile("test_file.dat") }}

{{>>> rdd.collect() [u'1\xc7Celveland\xc7Ohio', u'2\xc7Durham\xc7NC', 
u'3Dallas\xc7Texas'] }}

{{>>> rdd.map(lambda p: p.split("\xc7")).collect() UnicodeDecodeError: 'ascii' 
codec can't decode byte 0xc7 in position 0: ordinal not in range(128) }}

{{>>> 
sqlContext.read.format("text").option("delimiter","Ç").option("encoding","ISO-8859").load("/user/cloudera/test_file.dat").show()
 }}
|1ÇCelvelandÇOhio|

{{2ÇDurhamÇNC}}

{{ 3DallasÇTexas}}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-26971) How to read delimiter (Cedilla) in spark RDD and Dataframes

Reply via email to