Re: Help With unstructured text file with spark scala

2022-02-25 Thread Danilo Sousa
Rafael Mendes,

Are you from ?

Thanks.
> On 21 Feb 2022, at 15:33, Danilo Sousa  wrote:
> 
> Yes, this a only single file.
> 
> Thanks Rafael Mendes.
> 
>> On 13 Feb 2022, at 07:13, Rafael Mendes > > wrote:
>> 
>> Hi, Danilo.
>> Do you have a single large file, only?
>> If so, I guess you can use tools like sed/awk to split it into more files 
>> based on layout, so you can read these files into Spark.
>> 
>> 
>> Em qua, 9 de fev de 2022 09:30, Bitfox > > escreveu:
>> Hi
>> 
>> I am not sure about the total situation.
>> But if you want a scala integration I think it could use regex to match and 
>> capture the keywords.
>> Here I wrote one you can modify by your end.
>> 
>> import scala.io.Source
>> import scala.collection.mutable.ArrayBuffer
>> 
>> val list1 = ArrayBuffer[(String,String,String)]()
>> val list2 = ArrayBuffer[(String,String)]()
>> 
>> 
>> val patt1 = """^(.*)#(.*)#([^#]*)$""".r
>> val patt2 = """^(.*)#([^#]*)$""".r
>> 
>> val file = "1.txt"
>> val lines = Source.fromFile(file).getLines()
>> 
>> for ( x <- lines ) {
>>   x match {
>> case patt1(k,v,z) => list1 += ((k,v,z))
>> case patt2(k,v) => list2 += ((k,v))
>> case _ => println("no match")
>>   }
>> }
>> 
>> 
>> Now the list1 and list2 have the elements you wanted, you can convert them 
>> to a dataframe easily.
>> 
>> Thanks.
>> 
>> On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa > > wrote:
>> Hello
>> 
>> 
>> Yes, for this block I can open as csv with # delimiter, but have the block 
>> that is no csv format. 
>> 
>> This is the likely key value. 
>> 
>> We have two different layouts in the same file. This is the “problem”.
>> 
>> Thanks for your time.
>> 
>> 
>> 
>>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>> 
>>> Contrato#123456 - Test
>>> Empresa#Test
>> 
>>> On 9 Feb 2022, at 00:58, Bitfox >> > wrote:
>>> 
>>> Hello
>>> 
>>> You can treat it as a csf file and load it from spark:
>>> 
>>> >>> df = spark.read.format("csv").option("inferSchema", 
>>> >>> "true").option("header", "true").option("sep","#").load(csv_file)
>>> >>> df.show()
>>> ++---+-+
>>> |   Plano|Código Beneficiário|Nome Beneficiário|
>>> ++---+-+
>>> |58693 - NACIONAL ...|   65751353|   Jose Silva|
>>> |58693 - NACIONAL ...|   65751388|  Joana Silva|
>>> |58693 - NACIONAL ...|   65751353| Felipe Silva|
>>> |58693 - NACIONAL ...|   65751388|  Julia Silva|
>>> ++---+-+
>>> 
>>> 
>>> cat csv_file:
>>> 
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>> 
>>> 
>>> Regards
>>> 
>>> 
>>> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa >> > wrote:
>>> Hi
>>> I have to transform unstructured text to dataframe.
>>> Could anyone please help with Scala code ?
>>> 
>>> Dataframe need as:
>>> 
>>> operadora filial unidade contrato empresa plano codigo_beneficiario 
>>> nome_beneficiario
>>> 
>>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>> 
>>> Contrato#123456 - Test
>>> Empresa#Test
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>> 
>>> Contrato#898011000 - FUNDACAO GERDAU
>>> Empresa#FUNDACAO GERDAU
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> 
>>> 
>> 
> 



Re: Help With unstructured text file with spark scala

2022-02-21 Thread Danilo Sousa
Yes, this a only single file.

Thanks Rafael Mendes.

> On 13 Feb 2022, at 07:13, Rafael Mendes  wrote:
> 
> Hi, Danilo.
> Do you have a single large file, only?
> If so, I guess you can use tools like sed/awk to split it into more files 
> based on layout, so you can read these files into Spark.
> 
> 
> Em qua, 9 de fev de 2022 09:30, Bitfox  escreveu:
> Hi
> 
> I am not sure about the total situation.
> But if you want a scala integration I think it could use regex to match and 
> capture the keywords.
> Here I wrote one you can modify by your end.
> 
> import scala.io.Source
> import scala.collection.mutable.ArrayBuffer
> 
> val list1 = ArrayBuffer[(String,String,String)]()
> val list2 = ArrayBuffer[(String,String)]()
> 
> 
> val patt1 = """^(.*)#(.*)#([^#]*)$""".r
> val patt2 = """^(.*)#([^#]*)$""".r
> 
> val file = "1.txt"
> val lines = Source.fromFile(file).getLines()
> 
> for ( x <- lines ) {
>   x match {
> case patt1(k,v,z) => list1 += ((k,v,z))
> case patt2(k,v) => list2 += ((k,v))
> case _ => println("no match")
>   }
> }
> 
> 
> Now the list1 and list2 have the elements you wanted, you can convert them to 
> a dataframe easily.
> 
> Thanks.
> 
> On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa  > wrote:
> Hello
> 
> 
> Yes, for this block I can open as csv with # delimiter, but have the block 
> that is no csv format. 
> 
> This is the likely key value. 
> 
> We have two different layouts in the same file. This is the “problem”.
> 
> Thanks for your time.
> 
> 
> 
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>> 
>> Contrato#123456 - Test
>> Empresa#Test
> 
>> On 9 Feb 2022, at 00:58, Bitfox > > wrote:
>> 
>> Hello
>> 
>> You can treat it as a csf file and load it from spark:
>> 
>> >>> df = spark.read.format("csv").option("inferSchema", 
>> >>> "true").option("header", "true").option("sep","#").load(csv_file)
>> >>> df.show()
>> ++---+-+
>> |   Plano|Código Beneficiário|Nome Beneficiário|
>> ++---+-+
>> |58693 - NACIONAL ...|   65751353|   Jose Silva|
>> |58693 - NACIONAL ...|   65751388|  Joana Silva|
>> |58693 - NACIONAL ...|   65751353| Felipe Silva|
>> |58693 - NACIONAL ...|   65751388|  Julia Silva|
>> ++---+-+
>> 
>> 
>> cat csv_file:
>> 
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> 
>> 
>> Regards
>> 
>> 
>> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa > > wrote:
>> Hi
>> I have to transform unstructured text to dataframe.
>> Could anyone please help with Scala code ?
>> 
>> Dataframe need as:
>> 
>> operadora filial unidade contrato empresa plano codigo_beneficiario 
>> nome_beneficiario
>> 
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>> 
>> Contrato#123456 - Test
>> Empresa#Test
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>> 
>> Contrato#898011000 - FUNDACAO GERDAU
>> Empresa#FUNDACAO GERDAU
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> 
>> 
> 



Re: Help With unstructured text file with spark scala

2022-02-13 Thread Rafael Mendes
Hi, Danilo.
Do you have a single large file, only?
If so, I guess you can use tools like sed/awk to split it into more files
based on layout, so you can read these files into Spark.


Em qua, 9 de fev de 2022 09:30, Bitfox  escreveu:

> Hi
>
> I am not sure about the total situation.
> But if you want a scala integration I think it could use regex to match
> and capture the keywords.
> Here I wrote one you can modify by your end.
>
> import scala.io.Source
>
> import scala.collection.mutable.ArrayBuffer
>
>
> val list1 = ArrayBuffer[(String,String,String)]()
>
> val list2 = ArrayBuffer[(String,String)]()
>
>
>
> val patt1 = """^(.*)#(.*)#([^#]*)$""".r
>
> val patt2 = """^(.*)#([^#]*)$""".r
>
>
> val file = "1.txt"
>
> val lines = Source.fromFile(file).getLines()
>
>
> for ( x <- lines ) {
>
>   x match {
>
> case patt1(k,v,z) => list1 += ((k,v,z))
>
> case patt2(k,v) => list2 += ((k,v))
>
> case _ => println("no match")
>
>   }
>
> }
>
>
>
> Now the list1 and list2 have the elements you wanted, you can convert them
> to a dataframe easily.
>
>
> Thanks.
>
> On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa 
> wrote:
>
>> Hello
>>
>>
>> Yes, for this block I can open as csv with # delimiter, but have the
>> block that is no csv format.
>>
>> This is the likely key value.
>>
>> We have two different layouts in the same file. This is the “problem”.
>>
>> Thanks for your time.
>>
>>
>>
>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>>
>>> Contrato#123456 - Test
>>> Empresa#Test
>>
>>
>> On 9 Feb 2022, at 00:58, Bitfox  wrote:
>>
>> Hello
>>
>> You can treat it as a csf file and load it from spark:
>>
>> >>> df = spark.read.format("csv").option("inferSchema",
>> "true").option("header", "true").option("sep","#").load(csv_file)
>> >>> df.show()
>> ++---+-+
>> |   Plano|Código Beneficiário|Nome Beneficiário|
>> ++---+-+
>> |58693 - NACIONAL ...|   65751353|   Jose Silva|
>> |58693 - NACIONAL ...|   65751388|  Joana Silva|
>> |58693 - NACIONAL ...|   65751353| Felipe Silva|
>> |58693 - NACIONAL ...|   65751388|  Julia Silva|
>> ++---+-+
>>
>>
>> cat csv_file:
>>
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>
>>
>> Regards
>>
>>
>> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
>> wrote:
>>
>>> Hi
>>> I have to transform unstructured text to dataframe.
>>> Could anyone please help with Scala code ?
>>>
>>> Dataframe need as:
>>>
>>> operadora filial unidade contrato empresa plano codigo_beneficiario
>>> nome_beneficiario
>>>
>>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>>
>>> Contrato#123456 - Test
>>> Empresa#Test
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>>
>>> Contrato#898011000 - FUNDACAO GERDAU
>>> Empresa#FUNDACAO GERDAU
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: Help With unstructured text file with spark scala

2022-02-09 Thread Bitfox
Hi

I am not sure about the total situation.
But if you want a scala integration I think it could use regex to match and
capture the keywords.
Here I wrote one you can modify by your end.

import scala.io.Source

import scala.collection.mutable.ArrayBuffer


val list1 = ArrayBuffer[(String,String,String)]()

val list2 = ArrayBuffer[(String,String)]()



val patt1 = """^(.*)#(.*)#([^#]*)$""".r

val patt2 = """^(.*)#([^#]*)$""".r


val file = "1.txt"

val lines = Source.fromFile(file).getLines()


for ( x <- lines ) {

  x match {

case patt1(k,v,z) => list1 += ((k,v,z))

case patt2(k,v) => list2 += ((k,v))

case _ => println("no match")

  }

}



Now the list1 and list2 have the elements you wanted, you can convert them
to a dataframe easily.


Thanks.

On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa 
wrote:

> Hello
>
>
> Yes, for this block I can open as csv with # delimiter, but have the block
> that is no csv format.
>
> This is the likely key value.
>
> We have two different layouts in the same file. This is the “problem”.
>
> Thanks for your time.
>
>
>
> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>
>
> On 9 Feb 2022, at 00:58, Bitfox  wrote:
>
> Hello
>
> You can treat it as a csf file and load it from spark:
>
> >>> df = spark.read.format("csv").option("inferSchema",
> "true").option("header", "true").option("sep","#").load(csv_file)
> >>> df.show()
> ++---+-+
> |   Plano|Código Beneficiário|Nome Beneficiário|
> ++---+-+
> |58693 - NACIONAL ...|   65751353|   Jose Silva|
> |58693 - NACIONAL ...|   65751388|  Joana Silva|
> |58693 - NACIONAL ...|   65751353| Felipe Silva|
> |58693 - NACIONAL ...|   65751388|  Julia Silva|
> ++---+-+
>
>
> cat csv_file:
>
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>
>
> Regards
>
>
> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
> wrote:
>
>> Hi
>> I have to transform unstructured text to dataframe.
>> Could anyone please help with Scala code ?
>>
>> Dataframe need as:
>>
>> operadora filial unidade contrato empresa plano codigo_beneficiario
>> nome_beneficiario
>>
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>
>> Contrato#898011000 - FUNDACAO GERDAU
>> Empresa#FUNDACAO GERDAU
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Help With unstructured text file with spark scala

2022-02-09 Thread Danilo Sousa
Hello, how are you?

Thanks for your time

> Does the data contain records? 
Yes
> Are the records "homogenous" ; ie; do they have the same fields?
Yes the data is homogenous but have “two layouts” in the same file.
> What is the format of the data?
All data is string file .txt
> Are records separated by lines/seperators?
Yes, the delimiter is “#” but as said, we have two layouts in the same file
This likely key value
>Carteira em#27/12/2019##Todos os Beneficiários
>Operadora#AMIL
>Filial#SÃO PAULO#Unidade#Guarulhos
> 
>Contrato#123456 - Test
>Empresa#Test

And this like csv format

>Plano#Código Beneficiário#Nome Beneficiário
>58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>58693 - NACIONAL R COPART PJCE#073930313#Maria Silva

> Is the data sharded across multiple files?
No
> How big is each shard?
Approximately 20gb

> On 8 Feb 2022, at 16:56, Lalwani, Jayesh  wrote:
> 
> You will need to provide more info.
> 
> Does the data contain records? 
> Are the records "homogenous" ; ie; do they have the same fields?
> What is the format of the data?
> Are records separated by lines/seperators?
> Is the data sharded across multiple files?
> How big is each shard?
> 
> 
> 
> On 2/8/22, 11:50 AM, "Danilo Sousa"  wrote:
> 
>CAUTION: This email originated from outside of the organization. Do not 
> click links or open attachments unless you can confirm the sender and know 
> the content is safe.
> 
> 
> 
>Hi
>I have to transform unstructured text to dataframe.
>Could anyone please help with Scala code ?
> 
>Dataframe need as:
> 
>operadora filial unidade contrato empresa plano codigo_beneficiario 
> nome_beneficiario
> 
>Relação de Beneficiários Ativos e Excluídos
>Carteira em#27/12/2019##Todos os Beneficiários
>Operadora#AMIL
>Filial#SÃO PAULO#Unidade#Guarulhos
> 
>Contrato#123456 - Test
>Empresa#Test
>Plano#Código Beneficiário#Nome Beneficiário
>58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
> 
>Contrato#898011000 - FUNDACAO GERDAU
>Empresa#FUNDACAO GERDAU
>Plano#Código Beneficiário#Nome Beneficiário
>58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>-
>To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Help With unstructured text file with spark scala

2022-02-09 Thread Danilo Sousa
Hello


Yes, for this block I can open as csv with # delimiter, but have the block that 
is no csv format. 

This is the likely key value. 

We have two different layouts in the same file. This is the “problem”.

Thanks for your time.



> Relação de Beneficiários Ativos e Excluídos
> Carteira em#27/12/2019##Todos os Beneficiários
> Operadora#AMIL
> Filial#SÃO PAULO#Unidade#Guarulhos
> 
> Contrato#123456 - Test
> Empresa#Test

> On 9 Feb 2022, at 00:58, Bitfox  wrote:
> 
> Hello
> 
> You can treat it as a csf file and load it from spark:
> 
> >>> df = spark.read.format("csv").option("inferSchema", 
> >>> "true").option("header", "true").option("sep","#").load(csv_file)
> >>> df.show()
> ++---+-+
> |   Plano|Código Beneficiário|Nome Beneficiário|
> ++---+-+
> |58693 - NACIONAL ...|   65751353|   Jose Silva|
> |58693 - NACIONAL ...|   65751388|  Joana Silva|
> |58693 - NACIONAL ...|   65751353| Felipe Silva|
> |58693 - NACIONAL ...|   65751388|  Julia Silva|
> ++---+-+
> 
> 
> cat csv_file:
> 
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
> 
> 
> Regards
> 
> 
> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa  > wrote:
> Hi
> I have to transform unstructured text to dataframe.
> Could anyone please help with Scala code ?
> 
> Dataframe need as:
> 
> operadora filial unidade contrato empresa plano codigo_beneficiario 
> nome_beneficiario
> 
> Relação de Beneficiários Ativos e Excluídos
> Carteira em#27/12/2019##Todos os Beneficiários
> Operadora#AMIL
> Filial#SÃO PAULO#Unidade#Guarulhos
> 
> Contrato#123456 - Test
> Empresa#Test
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
> 
> Contrato#898011000 - FUNDACAO GERDAU
> Empresa#FUNDACAO GERDAU
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 



Re: Help With unstructured text file with spark scala

2022-02-08 Thread Bitfox
Hello

You can treat it as a csf file and load it from spark:

>>> df = spark.read.format("csv").option("inferSchema",
"true").option("header", "true").option("sep","#").load(csv_file)

>>> df.show()

++---+-+

|   Plano|Código Beneficiário|Nome Beneficiário|

++---+-+

|58693 - NACIONAL ...|   65751353|   Jose Silva|

|58693 - NACIONAL ...|   65751388|  Joana Silva|

|58693 - NACIONAL ...|   65751353| Felipe Silva|

|58693 - NACIONAL ...|   65751388|  Julia Silva|

++---+-+



cat csv_file:


Plano#Código Beneficiário#Nome Beneficiário

58693 - NACIONAL R COPART PJCE#065751353#Jose Silva

58693 - NACIONAL R COPART PJCE#065751388#Joana Silva

58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva

58693 - NACIONAL R COPART PJCE#065751388#Julia Silva



Regards



On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
wrote:

> Hi
> I have to transform unstructured text to dataframe.
> Could anyone please help with Scala code ?
>
> Dataframe need as:
>
> operadora filial unidade contrato empresa plano codigo_beneficiario
> nome_beneficiario
>
> Relação de Beneficiários Ativos e Excluídos
> Carteira em#27/12/2019##Todos os Beneficiários
> Operadora#AMIL
> Filial#SÃO PAULO#Unidade#Guarulhos
>
> Contrato#123456 - Test
> Empresa#Test
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>
> Contrato#898011000 - FUNDACAO GERDAU
> Empresa#FUNDACAO GERDAU
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Help With unstructured text file with spark scala

2022-02-08 Thread Lalwani, Jayesh
You will need to provide more info.

Does the data contain records? 
Are the records "homogenous" ; ie; do they have the same fields?
What is the format of the data?
Are records separated by lines/seperators?
Is the data sharded across multiple files?
How big is each shard?



On 2/8/22, 11:50 AM, "Danilo Sousa"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi
I have to transform unstructured text to dataframe.
Could anyone please help with Scala code ?

Dataframe need as:

operadora filial unidade contrato empresa plano codigo_beneficiario 
nome_beneficiario

Relação de Beneficiários Ativos e Excluídos
Carteira em#27/12/2019##Todos os Beneficiários
Operadora#AMIL
Filial#SÃO PAULO#Unidade#Guarulhos

Contrato#123456 - Test
Empresa#Test
Plano#Código Beneficiário#Nome Beneficiário
58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
58693 - NACIONAL R COPART PJCE#073930313#Maria Silva

Contrato#898011000 - FUNDACAO GERDAU
Empresa#FUNDACAO GERDAU
Plano#Código Beneficiário#Nome Beneficiário
58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org