In short, = Current =
BSP core: Input partitioning + converting to VertexWritable Graph module: Reads only VertexWritable = Future = BSP core: Input partitioning Graph module: Reads partition and parses Vertex structure On Tue, May 7, 2013 at 3:53 AM, Edward J. Yoon <[email protected]> wrote: > I think it was misunderstanding of the term of 'remove' and 'record > converter'. > > PartitioningRunner converts records. I call this as a 'record > converter'. But, there's no need to write converted records in > PartitioningRunner. Partitioner is just a partitioner in BSP core > module. > > On Tue, May 7, 2013 at 3:46 AM, Edward J. Yoon <[email protected]> wrote: >> Currently, the PartitioningRunner writes converted records to >> partition file, then GraphJobRunner reads VertexWritable, NullWritable >> K/V records. In other words, >> >> 1) input record: 'a\tb\tc' // assume that input is Text >> 2) partition files: the sequence of Vertex writable >> 3) GraphJobRunner.loadVertices() reads sequence format partition file. >> >> My suggestion is, just writes raw records to partition file in >> PartitioningRunner. >> >> 1) input record: 'a\tb\tc' // assume that input is Text >> 2) partition files: 'a\tb\tc' // data shuffled by partition ID, but >> format is the same with original. >> 3) GraphJobRunner.loadVertices() reads records from assigned >> partition, and parse Vertex structure. >> >> Few lines will be changed. >> >> Why? As I described in Wiki, NoSQLs table input case (which supports >> range or random access by sorted key), there's no need to >> re-partitioning. Because they are already range partitioned. It means >> that Parsing vertex structure is needed at GraphJobRunner. >> >> With or without Suraj's suggestion, parsing vertex structure should be >> done at GraphJobRunner.loadVertices() method to prepare the NoSQLs >> input format. >> >> Can you understand? >> >> >> On Tue, May 7, 2013 at 2:55 AM, Tommaso Teofili >> <[email protected]> wrote: >>> 2013/5/6 Edward J. Yoon <[email protected]> >>> >>>> > - Instead of running a separate job, we inject a partitioning superstep >>>> > before the first superstep of the job. (This has a dependency on the >>>> > Superstep API) >>>> > - The partitions instead of being written to HDFS, which is creating a >>>> copy >>>> > of input files in HDFS Cluster (too costly I believe), should be written >>>> to >>>> > local files and read from. >>>> > - For graph jobs, we can configure this partitioning superstep class >>>> > specific to graph partitioning class that partitions and loads vertices. >>>> >>>> I believe that above suggestion can be a future improvement task. >>>> >>>> > This sure has some dependencies. But would be a graceful solution and can >>>> > tackle every problem. This is what I want to achieve in the end. Please >>>> > proceed if you have any intermediate ways to reach here faster. >>>> >>>> If you understand my plan now, Please let me know so that I can start >>>> the work. My patch will change only few lines. >>>> >>> >>> while to me it's clear what Suraj's proposal is, I'm not completely sure >>> about what your final proposal would be, could you explain that in more >>> detail (or otherwise perhaps a path to review it's enough) ? >>> >>> >>>> >>>> Finally, I think now we can prepare the integration with NoSQLs table >>>> input format. >>>> >>> >>> as I said, I'd like to have a broad consensus before doing any significant >>> change to core stuff. >>> >>> thanks, >>> Tommaso >>> >>> p.s.: >>> probably worth a different thread: what's the NoSQL usage scenario with >>> regard to Hama? >>> >>> >>> >>>> >>>> On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <[email protected]> >>>> wrote: >>>> > I am assuming that the storage of vertices (NoSQL or any other format) >>>> need >>>> > not be updated after every iteration. >>>> > >>>> > Based on the above assumption, I have the following suggestions: >>>> > >>>> > - Instead of running a separate job, we inject a partitioning superstep >>>> > before the first superstep of the job. (This has a dependency on the >>>> > Superstep API) >>>> > - The partitions instead of being written to HDFS, which is creating a >>>> copy >>>> > of input files in HDFS Cluster (too costly I believe), should be written >>>> to >>>> > local files and read from. >>>> > - For graph jobs, we can configure this partitioning superstep class >>>> > specific to graph partitioning class that partitions and loads vertices. >>>> > >>>> > This sure has some dependencies. But would be a graceful solution and can >>>> > tackle every problem. This is what I want to achieve in the end. Please >>>> > proceed if you have any intermediate ways to reach here faster. >>>> > >>>> > Regards, >>>> > Suraj >>>> > >>>> > >>>> > >>>> > >>>> > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] >>>> >wrote: >>>> > >>>> >> P.S., BSPJob (with table input) also the same. It's not only for >>>> GraphJob. >>>> >> >>>> >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> >>>> >> wrote: >>>> >> > All, >>>> >> > >>>> >> > I've also roughly described details about design of Graph APIs[1]. To >>>> >> > reduce our misunderstandings (please read first Partitioning and >>>> >> > GraphModuleInternals documents), >>>> >> > >>>> >> > * In NoSQLs case, there's obviously no need to Hash-partitioning or >>>> >> > rewrite partition files on HDFS. So, in these input cases, I think >>>> >> > vertex structure should be parsed at GraphJobRunner.loadVertices() >>>> >> > method. >>>> >> > >>>> >> > At here, we faced two options: 1) The current implementation of >>>> >> > 'PartitioningRunner' writes converted vertices on sequence format >>>> >> > partition files. And GraphJobRunner reads only Vertex Writable >>>> >> > objects. If input is table, we maybe have to skip the Partitioning job >>>> >> > and have to parse vertex structure at loadVertices() method after >>>> >> > checking some conditions. 2) PartitioningRunner just writes raw >>>> >> > records to proper partition files after checking its partition ID. And >>>> >> > GraphJobRunner.loadVertices() always parses and loads vertices. >>>> >> > >>>> >> > I was mean that I prefer the latter and there's no need to write >>>> >> > VertexWritable files. It's not related whether graph will support only >>>> >> > Seq format or not. Hope my explanation is enough! >>>> >> > >>>> >> > 1. http://wiki.apache.org/hama/GraphModuleInternals >>>> >> > >>>> >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon < >>>> [email protected]> >>>> >> wrote: >>>> >> >> I've described my big picture here: >>>> >> http://wiki.apache.org/hama/Partitioning >>>> >> >> >>>> >> >> Please review and feedback whether this is acceptable. >>>> >> >> >>>> >> >> >>>> >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: >>>> >> >>> p.s., i think theres mis understand. it doesn't mean that graph will >>>> >> support only sequence file format. Main is whether converting at >>>> >> patitioning stage or loadVertices stage. >>>> >> >>> >>>> >> >>> Sent from my iPhone >>>> >> >>> >>>> >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> >>>> wrote: >>>> >> >>> >>>> >> >>>> Sure, Please go ahead. >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < >>>> [email protected] >>>> >> >wrote: >>>> >> >>>> >>>> >> >>>>>>> Please let me know before this is changed, I would like to work >>>> on >>>> >> a >>>> >> >>>>>>> separate branch. >>>> >> >>>>> >>>> >> >>>>> I personally, we have to focus on high priority tasks. and more >>>> >> >>>>> feedbacks and contributions from users. So, if changes made, I'll >>>> >> >>>>> release periodically. If you want to work on another place, please >>>> >> do. >>>> >> >>>>> I don't want to wait your patches. >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < >>>> >> [email protected]> >>>> >> >>>>> wrote: >>>> >> >>>>>> For preparing integration with NoSQLs, of course, maybe condition >>>> >> >>>>>> check (whether converted or not) can be used without removing >>>> record >>>> >> >>>>>> converter. >>>> >> >>>>>> >>>> >> >>>>>> We need to discuss everything. >>>> >> >>>>>> >>>> >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < >>>> [email protected] >>>> >> > >>>> >> >>>>> wrote: >>>> >> >>>>>>> I am still -1 if this means our graph module can work only on >>>> >> sequential >>>> >> >>>>>>> file format. >>>> >> >>>>>>> Please note that you can set record converter to null and make >>>> >> changes >>>> >> >>>>> to >>>> >> >>>>>>> loadVertices for what you desire here. >>>> >> >>>>>>> >>>> >> >>>>>>> If we came to this design, because TextInputFormat is >>>> inefficient, >>>> >> would >>>> >> >>>>>>> this work for Avro or Thrift input format? >>>> >> >>>>>>> Please let me know before this is changed, I would like to work >>>> on >>>> >> a >>>> >> >>>>>>> separate branch. >>>> >> >>>>>>> You may proceed as you wish. >>>> >> >>>>>>> >>>> >> >>>>>>> Regards, >>>> >> >>>>>>> Suraj >>>> >> >>>>>>> >>>> >> >>>>>>> >>>> >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < >>>> >> [email protected] >>>> >> >>>>>> wrote: >>>> >> >>>>>>> >>>> >> >>>>>>>> I think 'record converter' should be removed. It's not good >>>> idea. >>>> >> >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input >>>> >> reader, we >>>> >> >>>>>>>> can move related classes into common module. >>>> >> >>>>>>>> >>>> >> >>>>>>>> Let's go with my original plan. >>>> >> >>>>>>>> >>>> >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < >>>> >> [email protected]> >>>> >> >>>>>>>> wrote: >>>> >> >>>>>>>>> Hi all, >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> I'm reading our old discussions about record converter, >>>> superstep >>>> >> >>>>>>>>> injection, and common module: >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc >>>> >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> To clarify goals and objectives: >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> 1. A parallel input partition is necessary for obtaining >>>> >> scalability >>>> >> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's >>>> >> not a >>>> >> >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please >>>> don't >>>> >> >>>>>>>>> shake). >>>> >> >>>>>>>>> 2. Input partitioning should be handled at BSP framework >>>> level, >>>> >> and >>>> >> >>>>> it >>>> >> >>>>>>>>> is for every Hama jobs, not only for Graph jobs. >>>> >> >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs >>>> input >>>> >> also >>>> >> >>>>>>>>> should be considered. >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> The current problem is that every input of graph jobs should >>>> be >>>> >> >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me >>>> know. >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> -- >>>> >> >>>>>>>>> Best Regards, Edward J. Yoon >>>> >> >>>>>>>>> @eddieyoon >>>> >> >>>>>>>> >>>> >> >>>>>>>> >>>> >> >>>>>>>> >>>> >> >>>>>>>> -- >>>> >> >>>>>>>> Best Regards, Edward J. Yoon >>>> >> >>>>>>>> @eddieyoon >>>> >> >>>>>> >>>> >> >>>>>> >>>> >> >>>>>> >>>> >> >>>>>> -- >>>> >> >>>>>> Best Regards, Edward J. Yoon >>>> >> >>>>>> @eddieyoon >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> -- >>>> >> >>>>> Best Regards, Edward J. Yoon >>>> >> >>>>> @eddieyoon >>>> >> >>>>> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> -- >>>> >> >> Best Regards, Edward J. Yoon >>>> >> >> @eddieyoon >>>> >> > >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > Best Regards, Edward J. Yoon >>>> >> > @eddieyoon >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Best Regards, Edward J. Yoon >>>> >> @eddieyoon >>>> >> >>>> >>>> >>>> >>>> -- >>>> Best Regards, Edward J. Yoon >>>> @eddieyoon >>>> >> >> >> >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon -- Best Regards, Edward J. Yoon @eddieyoon
