Re: Spark Streaming FileStream Nested File Support

Tathagata Das Fri, 03 Apr 2015 12:51:58 -0700

I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does not work with driver
fautl recovery.


TD

On Fri, Apr 3, 2015 at 12:23 PM, adamgerst <adamge...@gmail.com> wrote:

> So after pulling my hair out for a bit trying to convert one of my standard
> spark jobs to streaming I found that FileInputDStream does not support
> nested folders (see the brief mention here
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
> the fileStream method returns a FileInputDStream).  So before, for my
> standard job, I was reading from say
>
> s3n://mybucket/2015/03/02/*log
>
> And could also modify it to simply get an entire months worth of logs.
> Since the logs are split up based upon their date, when the batch ran for
> the day, I simply passed in a parameter of the date to make sure I was
> reading the correct data
>
> But since I want to turn this job into a streaming job I need to simply do
> something like
>
> s3n://mybucket/*log
>
> This would totally work fine if it were a standard spark application, but
> fails for streaming.  Is there anyway I can get around this limitation?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark Streaming FileStream Nested File Support

Reply via email to