Re: [akka-user] Parallel File Processing with Akka Actors?

Michael Frank Fri, 08 May 2015 16:59:51 -0700

what is the result of the log processing of a single file? is it someaggregation or summary, or are you performing some action for each log line?

it seems to me the most performant solution would be to not use actorsat all, but to create a dedicated dispatcher and process each log filein a Future. in this way you maximize your caching(data/instruction/readahead) and minimize your context switching. youalso don't have to worry about the fact that you are using synchronousI/O. if you are summarizing/aggregating the log file, then the resultof the Future is your summary, and you can pipe that result to an actorusing pipeTo().

this is optimizing for throughput however, not latency. in order tobalance throughput vs. latency, you might consider a bimodal approach,where files larger than a certain threshold get processed using asynchronous approach with Futures, and small files are processed in anactor. you could abstract the processing into a trait and share thattrait between both approaches.


that's just my 2 cents, however, without the benefit of much context.

-Michael

On 05/08/15 15:06, Harit Himanshu wrote:

Hi Idar

I just confirmed with some of our team mates that it depends upon ourcustomers.


 1. Some customers use local disk and remove logs after processing.
    There are customers who use NAS based storage. None uses SSD as
    per my understanding.
 2. The logs differ in size a lot. Depends on log rolling rules, this
    may range from some Megabytes to few Gigabytes.
 3. The processing is not much. we decide either to ignore the
    logLine(based on certain condition), encrypt certain data, and
    build a format(usually JSON).

Do you have better idea or would your recommendation differ based onthis information?


Thank you
+ Harit Himanshu


On Thursday, May 7, 2015 at 11:44:04 PM UTC-7, Idar Borlaug wrote:

    What filesystem and disks are you reading the files from? Reading
    a file in one actor is a good idea, because you can read it
    sequentially. Reading from 10 different places in the same file
    can be a lot slower or faster. MPIIO which are used in
    computational clusters have methods for splitting a file and
    reading one part each on different nodes.

    How much processing is there for each line?

    I would implement both alternatives and do some benchmarking. Maby
    a third would be to read the files in each LogLineProcessActor and
    ditch the FileActor.

    What would also be cool, is to have an async IO for reading the
    files. I have no experience with that.

    On Fri, May 8, 2015 at 2:23 AM Harit Himanshu
    <[email protected]> wrote:

        Hello

        This is what my use case looks like

        *Use Case*

        - Given many log files in range (2MB - 2GB), I need to parse
        each of these logs and apply some processing, generate Java
        |POJO|.
        - For this problem, lets assume that we have just |1| log file
        - Also, the idea is to making best use of System. Multiple
        cores are available.

        *Alternative 1*
        - Open file (synchronous), read each line, generate |POJO|s

        |FileActor  ->  read each line->  List<POJO>   |

        */Pros/*: simple to understand
        */Cons/*: Serial Process, not taking advantage of multiple
        cores in the system

        *Alternative 2*
        - Open File (synchronous), read |N| lines (|N| is
        configurable), pass on to different actors to process

        |                                                     /  
LogLineProcessActor  1
        FileActor  ->  LogLineProcessRouter  (with10  Actors)  --  
LogLineProcessActor  2
                                                             
\LogLineProcessActor  10|

        */Pros/* Some parallelization, by using different actors to
        process part of lines. Actors will make use of available cores
        in the system (? how, may be?)
        */Cons/* Still Serial, because file read in serial fashion

        *Questions*
        - is any of the above choice a good choice?
        - Are there better alternatives?

        Please provide valuable thoughts here

        Thanks a lot

-->>>>>>>>>> Read the docs: http://akka.io/docs/

        >>>>>>>>>> Check the FAQ:
        http://doc.akka.io/docs/akka/current/additional/faq.html
        <http://doc.akka.io/docs/akka/current/additional/faq.html>
        >>>>>>>>>> Search the archives:
        https://groups.google.com/group/akka-user
        <https://groups.google.com/group/akka-user>
        ---
        You received this message because you are subscribed to the
        Google Groups "Akka User List" group.
        To unsubscribe from this group and stop receiving emails from
        it, send an email to [email protected].
        To post to this group, send email to [email protected].
        Visit this group at http://groups.google.com/group/akka-user
        <http://groups.google.com/group/akka-user>.
        For more options, visit https://groups.google.com/d/optout
        <https://groups.google.com/d/optout>.

--
>>>>>>>>>> Read the docs: http://akka.io/docs/

>>>>>>>>>> Check the FAQ:http://doc.akka.io/docs/akka/current/additional/faq.html

>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---

You received this message because you are subscribed to the GoogleGroups "Akka User List" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.To post to this group, send email to [email protected]<mailto:[email protected]>.

Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

--

     Read the docs: http://akka.io/docs/
     Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
     Search the archives: https://groups.google.com/group/akka-user

---You received this message because you are subscribed to the Google Groups "Akka User List" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Parallel File Processing with Akka Actors?

Reply via email to