Here's a wild guess; it might be the fact that your first command uses tail
-f, so it doesn't close the input file handle when it hits the end of the
available bytes, while your second use of nc does this. If so, the last few
lines might be stuck in a buffer waiting to be forwarded. If so, Spark
wouldn't see these bytes.

You could test this by using nc or another program on the other end of the
socket and see if it receives all the bytes.

What happens if you add the -q 10 option to your nc command in the first
case? That is, force it to close when no more bytes are seen for 10 seconds?

HTH,
dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Wed, Feb 10, 2016 at 3:51 PM, Nipun Arora <nipunarora2...@gmail.com>
wrote:

> Hi All,
>
> I apologize for reposting, I wonder if anyone can explain this behavior?
> And what would be the best way to resolve this without introducing
> something like kafka in the midst.
> I basically have a logstash instance, and would like to stream output of
> logstash to spark_streaming without introducing a new message passing
> service like kafka/redis in the midst.
>
> We will eventually probably use kafka, but for now I need guaranteed
> delivery.
>
> For the tail -f <logfile> |nc -lk 9999 command, I wait for a significant
> time after spark stops receiving any data in it's microbatches. I confirm
> that it's not getting any data, i.e. the file end has probably been reached
> by printing the first two lines of every micro-batch.
>
> Thanks
> Nipun
>
>
>
> On Mon, Feb 8, 2016 at 10:05 PM Nipun Arora <nipunarora2...@gmail.com>
> wrote:
>
>> I have a spark-streaming service, where I am processing and detecting
>> anomalies on the basis of some offline generated model. I feed data into
>> this service from a log file, which is streamed using the following command
>>
>> tail -f <logfile>| nc -lk 9999
>>
>> Here the spark streaming service is taking data from port 9999. Once
>> spark has finished processing, and is showing that it is processing empty
>> micro-batches, I kill both spark, and the netcat process above. However, I
>> observe that the last few lines are being dropped in some cases, i.e. spark
>> streaming does not receive those log lines or they are not processed.
>>
>> However, I also observed that if I simply take the logfile as standard
>> input instead of tailing it, the connection is closed at the end of the
>> file, and no lines are dropped:
>>
>> nc -q 10 -lk 9999 < logfile
>>
>> Can anyone explain why this behavior is happening? And what could be a
>> better resolution to the problem of streaming log data to spark streaming
>> instance?
>>
>>
>> Thanks
>>
>> Nipun
>>
>

Reply via email to