> - Instead of running a separate job, we inject a partitioning superstep > before the first superstep of the job. (This has a dependency on the > Superstep API) > - The partitions instead of being written to HDFS, which is creating a copy > of input files in HDFS Cluster (too costly I believe), should be written to > local files and read from. > - For graph jobs, we can configure this partitioning superstep class > specific to graph partitioning class that partitions and loads vertices.
I believe that above suggestion can be a future improvement task. > This sure has some dependencies. But would be a graceful solution and can > tackle every problem. This is what I want to achieve in the end. Please > proceed if you have any intermediate ways to reach here faster. If you understand my plan now, Please let me know so that I can start the work. My patch will change only few lines. Finally, I think now we can prepare the integration with NoSQLs table input format. On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <[email protected]> wrote: > I am assuming that the storage of vertices (NoSQL or any other format) need > not be updated after every iteration. > > Based on the above assumption, I have the following suggestions: > > - Instead of running a separate job, we inject a partitioning superstep > before the first superstep of the job. (This has a dependency on the > Superstep API) > - The partitions instead of being written to HDFS, which is creating a copy > of input files in HDFS Cluster (too costly I believe), should be written to > local files and read from. > - For graph jobs, we can configure this partitioning superstep class > specific to graph partitioning class that partitions and loads vertices. > > This sure has some dependencies. But would be a graceful solution and can > tackle every problem. This is what I want to achieve in the end. Please > proceed if you have any intermediate ways to reach here faster. > > Regards, > Suraj > > > > > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected]>wrote: > >> P.S., BSPJob (with table input) also the same. It's not only for GraphJob. >> >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> >> wrote: >> > All, >> > >> > I've also roughly described details about design of Graph APIs[1]. To >> > reduce our misunderstandings (please read first Partitioning and >> > GraphModuleInternals documents), >> > >> > * In NoSQLs case, there's obviously no need to Hash-partitioning or >> > rewrite partition files on HDFS. So, in these input cases, I think >> > vertex structure should be parsed at GraphJobRunner.loadVertices() >> > method. >> > >> > At here, we faced two options: 1) The current implementation of >> > 'PartitioningRunner' writes converted vertices on sequence format >> > partition files. And GraphJobRunner reads only Vertex Writable >> > objects. If input is table, we maybe have to skip the Partitioning job >> > and have to parse vertex structure at loadVertices() method after >> > checking some conditions. 2) PartitioningRunner just writes raw >> > records to proper partition files after checking its partition ID. And >> > GraphJobRunner.loadVertices() always parses and loads vertices. >> > >> > I was mean that I prefer the latter and there's no need to write >> > VertexWritable files. It's not related whether graph will support only >> > Seq format or not. Hope my explanation is enough! >> > >> > 1. http://wiki.apache.org/hama/GraphModuleInternals >> > >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected]> >> wrote: >> >> I've described my big picture here: >> http://wiki.apache.org/hama/Partitioning >> >> >> >> Please review and feedback whether this is acceptable. >> >> >> >> >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: >> >>> p.s., i think theres mis understand. it doesn't mean that graph will >> support only sequence file format. Main is whether converting at >> patitioning stage or loadVertices stage. >> >>> >> >>> Sent from my iPhone >> >>> >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> wrote: >> >>> >> >>>> Sure, Please go ahead. >> >>>> >> >>>> >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <[email protected] >> >wrote: >> >>>> >> >>>>>>> Please let me know before this is changed, I would like to work on >> a >> >>>>>>> separate branch. >> >>>>> >> >>>>> I personally, we have to focus on high priority tasks. and more >> >>>>> feedbacks and contributions from users. So, if changes made, I'll >> >>>>> release periodically. If you want to work on another place, please >> do. >> >>>>> I don't want to wait your patches. >> >>>>> >> >>>>> >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < >> [email protected]> >> >>>>> wrote: >> >>>>>> For preparing integration with NoSQLs, of course, maybe condition >> >>>>>> check (whether converted or not) can be used without removing record >> >>>>>> converter. >> >>>>>> >> >>>>>> We need to discuss everything. >> >>>>>> >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <[email protected] >> > >> >>>>> wrote: >> >>>>>>> I am still -1 if this means our graph module can work only on >> sequential >> >>>>>>> file format. >> >>>>>>> Please note that you can set record converter to null and make >> changes >> >>>>> to >> >>>>>>> loadVertices for what you desire here. >> >>>>>>> >> >>>>>>> If we came to this design, because TextInputFormat is inefficient, >> would >> >>>>>>> this work for Avro or Thrift input format? >> >>>>>>> Please let me know before this is changed, I would like to work on >> a >> >>>>>>> separate branch. >> >>>>>>> You may proceed as you wish. >> >>>>>>> >> >>>>>>> Regards, >> >>>>>>> Suraj >> >>>>>>> >> >>>>>>> >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < >> [email protected] >> >>>>>> wrote: >> >>>>>>> >> >>>>>>>> I think 'record converter' should be removed. It's not good idea. >> >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input >> reader, we >> >>>>>>>> can move related classes into common module. >> >>>>>>>> >> >>>>>>>> Let's go with my original plan. >> >>>>>>>> >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < >> [email protected]> >> >>>>>>>> wrote: >> >>>>>>>>> Hi all, >> >>>>>>>>> >> >>>>>>>>> I'm reading our old discussions about record converter, superstep >> >>>>>>>>> injection, and common module: >> >>>>>>>>> >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 >> >>>>>>>>> >> >>>>>>>>> To clarify goals and objectives: >> >>>>>>>>> >> >>>>>>>>> 1. A parallel input partition is necessary for obtaining >> scalability >> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's >> not a >> >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't >> >>>>>>>>> shake). >> >>>>>>>>> 2. Input partitioning should be handled at BSP framework level, >> and >> >>>>> it >> >>>>>>>>> is for every Hama jobs, not only for Graph jobs. >> >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs input >> also >> >>>>>>>>> should be considered. >> >>>>>>>>> >> >>>>>>>>> The current problem is that every input of graph jobs should be >> >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know. >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> Best Regards, Edward J. Yoon >> >>>>>>>>> @eddieyoon >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> -- >> >>>>>>>> Best Regards, Edward J. Yoon >> >>>>>>>> @eddieyoon >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> Best Regards, Edward J. Yoon >> >>>>>> @eddieyoon >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Best Regards, Edward J. Yoon >> >>>>> @eddieyoon >> >>>>> >> >> >> >> >> >> >> >> -- >> >> Best Regards, Edward J. Yoon >> >> @eddieyoon >> > >> > >> > >> > -- >> > Best Regards, Edward J. Yoon >> > @eddieyoon >> >> >> >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon >> -- Best Regards, Edward J. Yoon @eddieyoon
