I think it was misunderstanding of the term of 'remove' and 'record converter'.
PartitioningRunner converts records. I call this as a 'record converter'. But, there's no need to write converted records in PartitioningRunner. Partitioner is just a partitioner in BSP core module. On Tue, May 7, 2013 at 3:46 AM, Edward J. Yoon <[email protected]> wrote: > Currently, the PartitioningRunner writes converted records to > partition file, then GraphJobRunner reads VertexWritable, NullWritable > K/V records. In other words, > > 1) input record: 'a\tb\tc' // assume that input is Text > 2) partition files: the sequence of Vertex writable > 3) GraphJobRunner.loadVertices() reads sequence format partition file. > > My suggestion is, just writes raw records to partition file in > PartitioningRunner. > > 1) input record: 'a\tb\tc' // assume that input is Text > 2) partition files: 'a\tb\tc' // data shuffled by partition ID, but > format is the same with original. > 3) GraphJobRunner.loadVertices() reads records from assigned > partition, and parse Vertex structure. > > Few lines will be changed. > > Why? As I described in Wiki, NoSQLs table input case (which supports > range or random access by sorted key), there's no need to > re-partitioning. Because they are already range partitioned. It means > that Parsing vertex structure is needed at GraphJobRunner. > > With or without Suraj's suggestion, parsing vertex structure should be > done at GraphJobRunner.loadVertices() method to prepare the NoSQLs > input format. > > Can you understand? > > > On Tue, May 7, 2013 at 2:55 AM, Tommaso Teofili > <[email protected]> wrote: >> 2013/5/6 Edward J. Yoon <[email protected]> >> >>> > - Instead of running a separate job, we inject a partitioning superstep >>> > before the first superstep of the job. (This has a dependency on the >>> > Superstep API) >>> > - The partitions instead of being written to HDFS, which is creating a >>> copy >>> > of input files in HDFS Cluster (too costly I believe), should be written >>> to >>> > local files and read from. >>> > - For graph jobs, we can configure this partitioning superstep class >>> > specific to graph partitioning class that partitions and loads vertices. >>> >>> I believe that above suggestion can be a future improvement task. >>> >>> > This sure has some dependencies. But would be a graceful solution and can >>> > tackle every problem. This is what I want to achieve in the end. Please >>> > proceed if you have any intermediate ways to reach here faster. >>> >>> If you understand my plan now, Please let me know so that I can start >>> the work. My patch will change only few lines. >>> >> >> while to me it's clear what Suraj's proposal is, I'm not completely sure >> about what your final proposal would be, could you explain that in more >> detail (or otherwise perhaps a path to review it's enough) ? >> >> >>> >>> Finally, I think now we can prepare the integration with NoSQLs table >>> input format. >>> >> >> as I said, I'd like to have a broad consensus before doing any significant >> change to core stuff. >> >> thanks, >> Tommaso >> >> p.s.: >> probably worth a different thread: what's the NoSQL usage scenario with >> regard to Hama? >> >> >> >>> >>> On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <[email protected]> >>> wrote: >>> > I am assuming that the storage of vertices (NoSQL or any other format) >>> need >>> > not be updated after every iteration. >>> > >>> > Based on the above assumption, I have the following suggestions: >>> > >>> > - Instead of running a separate job, we inject a partitioning superstep >>> > before the first superstep of the job. (This has a dependency on the >>> > Superstep API) >>> > - The partitions instead of being written to HDFS, which is creating a >>> copy >>> > of input files in HDFS Cluster (too costly I believe), should be written >>> to >>> > local files and read from. >>> > - For graph jobs, we can configure this partitioning superstep class >>> > specific to graph partitioning class that partitions and loads vertices. >>> > >>> > This sure has some dependencies. But would be a graceful solution and can >>> > tackle every problem. This is what I want to achieve in the end. Please >>> > proceed if you have any intermediate ways to reach here faster. >>> > >>> > Regards, >>> > Suraj >>> > >>> > >>> > >>> > >>> > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] >>> >wrote: >>> > >>> >> P.S., BSPJob (with table input) also the same. It's not only for >>> GraphJob. >>> >> >>> >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> >>> >> wrote: >>> >> > All, >>> >> > >>> >> > I've also roughly described details about design of Graph APIs[1]. To >>> >> > reduce our misunderstandings (please read first Partitioning and >>> >> > GraphModuleInternals documents), >>> >> > >>> >> > * In NoSQLs case, there's obviously no need to Hash-partitioning or >>> >> > rewrite partition files on HDFS. So, in these input cases, I think >>> >> > vertex structure should be parsed at GraphJobRunner.loadVertices() >>> >> > method. >>> >> > >>> >> > At here, we faced two options: 1) The current implementation of >>> >> > 'PartitioningRunner' writes converted vertices on sequence format >>> >> > partition files. And GraphJobRunner reads only Vertex Writable >>> >> > objects. If input is table, we maybe have to skip the Partitioning job >>> >> > and have to parse vertex structure at loadVertices() method after >>> >> > checking some conditions. 2) PartitioningRunner just writes raw >>> >> > records to proper partition files after checking its partition ID. And >>> >> > GraphJobRunner.loadVertices() always parses and loads vertices. >>> >> > >>> >> > I was mean that I prefer the latter and there's no need to write >>> >> > VertexWritable files. It's not related whether graph will support only >>> >> > Seq format or not. Hope my explanation is enough! >>> >> > >>> >> > 1. http://wiki.apache.org/hama/GraphModuleInternals >>> >> > >>> >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon < >>> [email protected]> >>> >> wrote: >>> >> >> I've described my big picture here: >>> >> http://wiki.apache.org/hama/Partitioning >>> >> >> >>> >> >> Please review and feedback whether this is acceptable. >>> >> >> >>> >> >> >>> >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: >>> >> >>> p.s., i think theres mis understand. it doesn't mean that graph will >>> >> support only sequence file format. Main is whether converting at >>> >> patitioning stage or loadVertices stage. >>> >> >>> >>> >> >>> Sent from my iPhone >>> >> >>> >>> >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> >>> wrote: >>> >> >>> >>> >> >>>> Sure, Please go ahead. >>> >> >>>> >>> >> >>>> >>> >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < >>> [email protected] >>> >> >wrote: >>> >> >>>> >>> >> >>>>>>> Please let me know before this is changed, I would like to work >>> on >>> >> a >>> >> >>>>>>> separate branch. >>> >> >>>>> >>> >> >>>>> I personally, we have to focus on high priority tasks. and more >>> >> >>>>> feedbacks and contributions from users. So, if changes made, I'll >>> >> >>>>> release periodically. If you want to work on another place, please >>> >> do. >>> >> >>>>> I don't want to wait your patches. >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < >>> >> [email protected]> >>> >> >>>>> wrote: >>> >> >>>>>> For preparing integration with NoSQLs, of course, maybe condition >>> >> >>>>>> check (whether converted or not) can be used without removing >>> record >>> >> >>>>>> converter. >>> >> >>>>>> >>> >> >>>>>> We need to discuss everything. >>> >> >>>>>> >>> >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < >>> [email protected] >>> >> > >>> >> >>>>> wrote: >>> >> >>>>>>> I am still -1 if this means our graph module can work only on >>> >> sequential >>> >> >>>>>>> file format. >>> >> >>>>>>> Please note that you can set record converter to null and make >>> >> changes >>> >> >>>>> to >>> >> >>>>>>> loadVertices for what you desire here. >>> >> >>>>>>> >>> >> >>>>>>> If we came to this design, because TextInputFormat is >>> inefficient, >>> >> would >>> >> >>>>>>> this work for Avro or Thrift input format? >>> >> >>>>>>> Please let me know before this is changed, I would like to work >>> on >>> >> a >>> >> >>>>>>> separate branch. >>> >> >>>>>>> You may proceed as you wish. >>> >> >>>>>>> >>> >> >>>>>>> Regards, >>> >> >>>>>>> Suraj >>> >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < >>> >> [email protected] >>> >> >>>>>> wrote: >>> >> >>>>>>> >>> >> >>>>>>>> I think 'record converter' should be removed. It's not good >>> idea. >>> >> >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input >>> >> reader, we >>> >> >>>>>>>> can move related classes into common module. >>> >> >>>>>>>> >>> >> >>>>>>>> Let's go with my original plan. >>> >> >>>>>>>> >>> >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < >>> >> [email protected]> >>> >> >>>>>>>> wrote: >>> >> >>>>>>>>> Hi all, >>> >> >>>>>>>>> >>> >> >>>>>>>>> I'm reading our old discussions about record converter, >>> superstep >>> >> >>>>>>>>> injection, and common module: >>> >> >>>>>>>>> >>> >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc >>> >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 >>> >> >>>>>>>>> >>> >> >>>>>>>>> To clarify goals and objectives: >>> >> >>>>>>>>> >>> >> >>>>>>>>> 1. A parallel input partition is necessary for obtaining >>> >> scalability >>> >> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's >>> >> not a >>> >> >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please >>> don't >>> >> >>>>>>>>> shake). >>> >> >>>>>>>>> 2. Input partitioning should be handled at BSP framework >>> level, >>> >> and >>> >> >>>>> it >>> >> >>>>>>>>> is for every Hama jobs, not only for Graph jobs. >>> >> >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs >>> input >>> >> also >>> >> >>>>>>>>> should be considered. >>> >> >>>>>>>>> >>> >> >>>>>>>>> The current problem is that every input of graph jobs should >>> be >>> >> >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me >>> know. >>> >> >>>>>>>>> >>> >> >>>>>>>>> -- >>> >> >>>>>>>>> Best Regards, Edward J. Yoon >>> >> >>>>>>>>> @eddieyoon >>> >> >>>>>>>> >>> >> >>>>>>>> >>> >> >>>>>>>> >>> >> >>>>>>>> -- >>> >> >>>>>>>> Best Regards, Edward J. Yoon >>> >> >>>>>>>> @eddieyoon >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> -- >>> >> >>>>>> Best Regards, Edward J. Yoon >>> >> >>>>>> @eddieyoon >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> -- >>> >> >>>>> Best Regards, Edward J. Yoon >>> >> >>>>> @eddieyoon >>> >> >>>>> >>> >> >> >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> Best Regards, Edward J. Yoon >>> >> >> @eddieyoon >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Best Regards, Edward J. Yoon >>> >> > @eddieyoon >>> >> >>> >> >>> >> >>> >> -- >>> >> Best Regards, Edward J. Yoon >>> >> @eddieyoon >>> >> >>> >>> >>> >>> -- >>> Best Regards, Edward J. Yoon >>> @eddieyoon >>> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon -- Best Regards, Edward J. Yoon @eddieyoon
