2013/5/6 Edward J. Yoon <[email protected]> > > - Instead of running a separate job, we inject a partitioning superstep > > before the first superstep of the job. (This has a dependency on the > > Superstep API) > > - The partitions instead of being written to HDFS, which is creating a > copy > > of input files in HDFS Cluster (too costly I believe), should be written > to > > local files and read from. > > - For graph jobs, we can configure this partitioning superstep class > > specific to graph partitioning class that partitions and loads vertices. > > I believe that above suggestion can be a future improvement task. > > > This sure has some dependencies. But would be a graceful solution and can > > tackle every problem. This is what I want to achieve in the end. Please > > proceed if you have any intermediate ways to reach here faster. > > If you understand my plan now, Please let me know so that I can start > the work. My patch will change only few lines. >
while to me it's clear what Suraj's proposal is, I'm not completely sure about what your final proposal would be, could you explain that in more detail (or otherwise perhaps a path to review it's enough) ? > > Finally, I think now we can prepare the integration with NoSQLs table > input format. > as I said, I'd like to have a broad consensus before doing any significant change to core stuff. thanks, Tommaso p.s.: probably worth a different thread: what's the NoSQL usage scenario with regard to Hama? > > On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <[email protected]> > wrote: > > I am assuming that the storage of vertices (NoSQL or any other format) > need > > not be updated after every iteration. > > > > Based on the above assumption, I have the following suggestions: > > > > - Instead of running a separate job, we inject a partitioning superstep > > before the first superstep of the job. (This has a dependency on the > > Superstep API) > > - The partitions instead of being written to HDFS, which is creating a > copy > > of input files in HDFS Cluster (too costly I believe), should be written > to > > local files and read from. > > - For graph jobs, we can configure this partitioning superstep class > > specific to graph partitioning class that partitions and loads vertices. > > > > This sure has some dependencies. But would be a graceful solution and can > > tackle every problem. This is what I want to achieve in the end. Please > > proceed if you have any intermediate ways to reach here faster. > > > > Regards, > > Suraj > > > > > > > > > > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] > >wrote: > > > >> P.S., BSPJob (with table input) also the same. It's not only for > GraphJob. > >> > >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> > >> wrote: > >> > All, > >> > > >> > I've also roughly described details about design of Graph APIs[1]. To > >> > reduce our misunderstandings (please read first Partitioning and > >> > GraphModuleInternals documents), > >> > > >> > * In NoSQLs case, there's obviously no need to Hash-partitioning or > >> > rewrite partition files on HDFS. So, in these input cases, I think > >> > vertex structure should be parsed at GraphJobRunner.loadVertices() > >> > method. > >> > > >> > At here, we faced two options: 1) The current implementation of > >> > 'PartitioningRunner' writes converted vertices on sequence format > >> > partition files. And GraphJobRunner reads only Vertex Writable > >> > objects. If input is table, we maybe have to skip the Partitioning job > >> > and have to parse vertex structure at loadVertices() method after > >> > checking some conditions. 2) PartitioningRunner just writes raw > >> > records to proper partition files after checking its partition ID. And > >> > GraphJobRunner.loadVertices() always parses and loads vertices. > >> > > >> > I was mean that I prefer the latter and there's no need to write > >> > VertexWritable files. It's not related whether graph will support only > >> > Seq format or not. Hope my explanation is enough! > >> > > >> > 1. http://wiki.apache.org/hama/GraphModuleInternals > >> > > >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon < > [email protected]> > >> wrote: > >> >> I've described my big picture here: > >> http://wiki.apache.org/hama/Partitioning > >> >> > >> >> Please review and feedback whether this is acceptable. > >> >> > >> >> > >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: > >> >>> p.s., i think theres mis understand. it doesn't mean that graph will > >> support only sequence file format. Main is whether converting at > >> patitioning stage or loadVertices stage. > >> >>> > >> >>> Sent from my iPhone > >> >>> > >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> > wrote: > >> >>> > >> >>>> Sure, Please go ahead. > >> >>>> > >> >>>> > >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < > [email protected] > >> >wrote: > >> >>>> > >> >>>>>>> Please let me know before this is changed, I would like to work > on > >> a > >> >>>>>>> separate branch. > >> >>>>> > >> >>>>> I personally, we have to focus on high priority tasks. and more > >> >>>>> feedbacks and contributions from users. So, if changes made, I'll > >> >>>>> release periodically. If you want to work on another place, please > >> do. > >> >>>>> I don't want to wait your patches. > >> >>>>> > >> >>>>> > >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < > >> [email protected]> > >> >>>>> wrote: > >> >>>>>> For preparing integration with NoSQLs, of course, maybe condition > >> >>>>>> check (whether converted or not) can be used without removing > record > >> >>>>>> converter. > >> >>>>>> > >> >>>>>> We need to discuss everything. > >> >>>>>> > >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < > [email protected] > >> > > >> >>>>> wrote: > >> >>>>>>> I am still -1 if this means our graph module can work only on > >> sequential > >> >>>>>>> file format. > >> >>>>>>> Please note that you can set record converter to null and make > >> changes > >> >>>>> to > >> >>>>>>> loadVertices for what you desire here. > >> >>>>>>> > >> >>>>>>> If we came to this design, because TextInputFormat is > inefficient, > >> would > >> >>>>>>> this work for Avro or Thrift input format? > >> >>>>>>> Please let me know before this is changed, I would like to work > on > >> a > >> >>>>>>> separate branch. > >> >>>>>>> You may proceed as you wish. > >> >>>>>>> > >> >>>>>>> Regards, > >> >>>>>>> Suraj > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < > >> [email protected] > >> >>>>>> wrote: > >> >>>>>>> > >> >>>>>>>> I think 'record converter' should be removed. It's not good > idea. > >> >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input > >> reader, we > >> >>>>>>>> can move related classes into common module. > >> >>>>>>>> > >> >>>>>>>> Let's go with my original plan. > >> >>>>>>>> > >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < > >> [email protected]> > >> >>>>>>>> wrote: > >> >>>>>>>>> Hi all, > >> >>>>>>>>> > >> >>>>>>>>> I'm reading our old discussions about record converter, > superstep > >> >>>>>>>>> injection, and common module: > >> >>>>>>>>> > >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc > >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 > >> >>>>>>>>> > >> >>>>>>>>> To clarify goals and objectives: > >> >>>>>>>>> > >> >>>>>>>>> 1. A parallel input partition is necessary for obtaining > >> scalability > >> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's > >> not a > >> >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please > don't > >> >>>>>>>>> shake). > >> >>>>>>>>> 2. Input partitioning should be handled at BSP framework > level, > >> and > >> >>>>> it > >> >>>>>>>>> is for every Hama jobs, not only for Graph jobs. > >> >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs > input > >> also > >> >>>>>>>>> should be considered. > >> >>>>>>>>> > >> >>>>>>>>> The current problem is that every input of graph jobs should > be > >> >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me > know. > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> Best Regards, Edward J. Yoon > >> >>>>>>>>> @eddieyoon > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> -- > >> >>>>>>>> Best Regards, Edward J. Yoon > >> >>>>>>>> @eddieyoon > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> -- > >> >>>>>> Best Regards, Edward J. Yoon > >> >>>>>> @eddieyoon > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> -- > >> >>>>> Best Regards, Edward J. Yoon > >> >>>>> @eddieyoon > >> >>>>> > >> >> > >> >> > >> >> > >> >> -- > >> >> Best Regards, Edward J. Yoon > >> >> @eddieyoon > >> > > >> > > >> > > >> > -- > >> > Best Regards, Edward J. Yoon > >> > @eddieyoon > >> > >> > >> > >> -- > >> Best Regards, Edward J. Yoon > >> @eddieyoon > >> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon >
