Re: Spark Streaming FileStream Nested File Support

Tathagata Das Fri, 03 Apr 2015 15:02:02 -0700

Yes, definitely can be added. Just haven't gotten around to doing it :)
There are proposals for this that you can try -
https://github.com/apache/spark/pull/2765/files . Have you review it at
some point.


On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter <adamge...@gmail.com> wrote:

> That doesn't seem like a good solution unfortunately as I would be needing
> this to work in a production environment.  Do you know why the limitation
> exists for FileInputDStream in the first place?  Unless I'm missing
> something important about how some of the internals work I don't see why
> this feature could be added in at some point.
>
> On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> I sort-a-hacky workaround is to use a queueStream where you can manually
>> create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
>> that this is for testing only as queueStream does not work with driver
>> fautl recovery.
>>
>> TD
>>
>> On Fri, Apr 3, 2015 at 12:23 PM, adamgerst <adamge...@gmail.com> wrote:
>>
>>> So after pulling my hair out for a bit trying to convert one of my
>>> standard
>>> spark jobs to streaming I found that FileInputDStream does not support
>>> nested folders (see the brief mention here
>>>
>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
>>> the fileStream method returns a FileInputDStream).  So before, for my
>>> standard job, I was reading from say
>>>
>>> s3n://mybucket/2015/03/02/*log
>>>
>>> And could also modify it to simply get an entire months worth of logs.
>>> Since the logs are split up based upon their date, when the batch ran for
>>> the day, I simply passed in a parameter of the date to make sure I was
>>> reading the correct data
>>>
>>> But since I want to turn this job into a streaming job I need to simply
>>> do
>>> something like
>>>
>>> s3n://mybucket/*log
>>>
>>> This would totally work fine if it were a standard spark application, but
>>> fails for streaming.  Is there anyway I can get around this limitation?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Spark Streaming FileStream Nested File Support

Reply via email to