Re: How to access line fileName in loading file using the textFile method

2018-09-26 Thread vermanurag
Spark has sc.wholeTextFiles() which returns RDD of tuple. First element of
tuple if the file name and second element is the file content.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to access line fileName in loading file using the textFile method

2018-09-24 Thread Maxim Gekk
> So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Maybe the input_file_name() function help you:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@input_file_name():org.apache.spark.sql.Column

On Mon, Sep 24, 2018 at 2:54 PM Soheil Pourbafrani 
wrote:

> Hi, My text data are in the form of text file. In the processing logic, I
> need to know each word is from which file. Actually, I need to tokenize the
> words and create the pair of . The naive solution is to
> call sc.textFile for each file and having the fileName in a variable,
> create the pairs, but it's not efficient and I got the StackOverflow error
> as dataset grew.
>
> So my question is supposing all files are in a directory and I read then
> using sc.textFile("path/*"), how can I understand each data is for which
> file?
>
> Is it possible (and needed) to customize the textFile method?
>


-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.g...@databricks.com

databricks.com

  


Re: How to access line fileName in loading file using the textFile method

2018-09-24 Thread Jörn Franke
You can create your own data source exactly doing this. 

Why is the file name important if the file content is the same?

> On 24. Sep 2018, at 13:53, Soheil Pourbafrani  wrote:
> 
> Hi, My text data are in the form of text file. In the processing logic, I 
> need to know each word is from which file. Actually, I need to tokenize the 
> words and create the pair of . The naive solution is to call 
> sc.textFile for each file and having the fileName in a variable, create the 
> pairs, but it's not efficient and I got the StackOverflow error as dataset 
> grew.
> 
> So my question is supposing all files are in a directory and I read then 
> using sc.textFile("path/*"), how can I understand each data is for which file?
> 
> Is it possible (and needed) to customize the textFile method?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to access line fileName in loading file using the textFile method

2018-09-24 Thread Soheil Pourbafrani
Hi, My text data are in the form of text file. In the processing logic, I
need to know each word is from which file. Actually, I need to tokenize the
words and create the pair of . The naive solution is to
call sc.textFile for each file and having the fileName in a variable,
create the pairs, but it's not efficient and I got the StackOverflow error
as dataset grew.

So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Is it possible (and needed) to customize the textFile method?