Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-23 Thread Jörn Franke
What is your use case? > On 23. Apr 2018, at 23:27, kant kodali wrote: > > Hi All, > > Is it ok to make I/O calls in UDF? other words is it a standard practice? > > Thanks! - To unsubscribe e-mail:

Spark+AI Summit 2018 (promo code within)

2018-04-23 Thread Scott walent
Spark+AI Summit is only 6 week away. Keynotes this year include talks from Tesla, Apple, Databricks, Andreessen Horowitz and many more! Use code *"*SparkList" and save 15% when registering at http://databricks.com/sparkaisummit We hope to see you there. -Scott

Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-23 Thread kant kodali
yes for sure it works. I am just not sure if it is a good idea when getting a stream of messages from kafka? On Mon, Apr 23, 2018 at 2:54 PM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > I have made simple rest call within UDF and it worked but not sure if it > can be applied

Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-23 Thread Sathish Kumaran Vairavelu
I have made simple rest call within UDF and it worked but not sure if it can be applied for large datasets but may be for small lookup files. Thanks On Mon, Apr 23, 2018 at 4:28 PM kant kodali wrote: > Hi All, > > Is it ok to make I/O calls in UDF? other words is it a

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
guys here the illustration https://github.com/parisni/SparkPdfExtractor Please add issues if any questions or improvement ideas Enjoy Cheers 2018-04-23 20:42 GMT+02:00 unk1102 : > Thanks much Nicolas really appreciate it. > > > > -- > Sent from:

Re: Spark dataset to byte array over grpc

2018-04-23 Thread Bryan Cutler
Hi Ashwin, This sounds like it might be a good use for Apache Arrow, if you are open to the type of format to exchange. As of Spark 2.3, Dataset has a method "toArrowPayload" that will convert a Dataset of Rows to a byte array in Arrow format, although the API is currently not public. Your

Spark dataset to byte array over grpc

2018-04-23 Thread Ashwin Sai Shankar
Hi! I'm building a spark app which runs a spark-sql query and send results to client over grpc(my proto file is configured to send the sql output as "bytes"). The client then displays the output rows. When I run spark.sql, I get a DataSet. How do I convert this to byte array? Also is there a

schema change for structured spark streaming using jsonl files

2018-04-23 Thread Lian Jiang
Hi, I am using structured spark streaming which reads jsonl files and writes into parquet files. I am wondering what's the process if jsonl files schema change. Suppose jsonl files are generated in \jsonl folder and the old schema is { "field1": String}. My proposal is: 1. write the jsonl files

Unsubscribe

2018-04-23 Thread Shahab Yunus
Unsubscribe

Unsubscribe

2018-04-23 Thread varma dantuluri
Unsubscribe -- Regards, Varma Dantuluri

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Thanks much Nicolas really appreciate it. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Structured Streaming] Restarting streaming query on exception/termination

2018-04-23 Thread Priyank Shrivastava
Thanks for the reply formice. I think that --supervise param helps to restart the whole spark application - what I want to be able to do is to only restart the structured streaming query which terminated due to error. Also, I am running my app in client mode. Thanks, Priyank On Sun, Apr 22,

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
sure then let me recap steps: 1. load pdfs in a local folder to hdfs avro 2. load avro in spark as a RDD 3. apply pdfbox to each csv and return content as string 4. write the result as a huge csv file That's some work guys for me to push all that. Should find some time however within 7 days

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Yes Nicolas. It would be great hell if you can push code to github and share URL. Thanks Deepak On Mon, Apr 23, 2018, 23:00 unk1102 wrote: > Hi Nicolas thanks much for guidance it was very useful information if you > can > push that code to github and share url it would

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for guidance it was very useful information if you can push that code to github and share url it would be a great help. Looking forward. If you can find time to push early it would be even greater help as I have to finish POC on this use case ASAP. -- Sent from:

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
2018-04-23 18:59 GMT+02:00 unk1102 : > Hi Nicolas thanks much for the reply. Do you have any sample code > somewhere? > ​I have some open-source code. I could find time to push on github if needed.​ > Do your just keep pdf in avro binary all the time? ​yes, I store

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere? Do your just keep pdf in avro binary all the time? How often you parse into text using pdfbox? Is it on demand basis or you always parse as text and keep pdf as binary in avro as just interim state? -- Sent from:

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Is there any open source code base to refer to for this kind of use case ? Thanks Deepak On Mon, Apr 23, 2018, 22:13 Nicolas Paris wrote: > Hi > > Problem is number of files on hadoop; > > > I deal with 50M pdf files. What I did is to put them in an avro table on > hdfs, >

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
Hi Problem is number of files on hadoop; I deal with 50M pdf files. What I did is to put them in an avro table on hdfs, as a binary column. Then I read it with spark and push that into pdfbox. Transforming 50M pdfs into text took 2hours on a 5 computers clusters About colors and formating, I

Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi I need guidance on dealing with large no of pdf files when using Hadoop and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it into text using these parsers and store it as text files but in doing

Error while processing statement: hive configuration hive.query.name does not exists.

2018-04-23 Thread Saran Pal
Hello All, While loading any data to hive table using talend, following error coming. Please help. Error while processing statement: hive configuration hive.query.name does not exists. Kind Regards Saran Pal +91 981-888-0977

[How To] Using Spark Session in internal called classes

2018-04-23 Thread Aakash Basu
Hi, I have created my own Model Tuner class which I want to use to tune models and return a Model object if the user expects. This Model Tuner is in a file which I would ideally import into another file and call the class and use it. Outer file {from where I'd be calling the Model Tuner): I am

flatMapGroupsWithState equivalent in PySpark

2018-04-23 Thread ZmeiGorynych
I need to write PySpark logic equivalent to what flatMapGroupsWithState does in Scala/Java. To be precise, I need to take an incoming stream of records, group them by an arbitrary attribute, and feed each group a record at at time to a separate instance of a user-defined (so 'black-box') Python

Getting Corrupt Records while loading data into dataframe from csv file

2018-04-23 Thread Shuporno Choudhury
Hi all, I have a manually created schema using which I am loading data from multiple csv files to a dataframe. Now, if there are certain records that fail the provided schema, is there a way to get those rejected records and continue with the process of loading data into the dataframe? As of now,