2013/5/6 Suraj Menon <[email protected]> > I am assuming that the storage of vertices (NoSQL or any other format) need > not be updated after every iteration. > > Based on the above assumption, I have the following suggestions: > > - Instead of running a separate job, we inject a partitioning superstep > before the first superstep of the job. (This has a dependency on the > Superstep API) >
could we do that without introducing that dependency? I mean would that work also if not using the Superstep API on the client side? > - The partitions instead of being written to HDFS, which is creating a copy > of input files in HDFS Cluster (too costly I believe), should be written to > local files and read from. > +1 > - For graph jobs, we can configure this partitioning superstep class > specific to graph partitioning class that partitions and loads vertices. > this seems to be inline with the above assumption thus it probably makes sense. > > This sure has some dependencies. But would be a graceful solution and can > tackle every problem. This is what I want to achieve in the end. Please > proceed if you have any intermediate ways to reach here faster. > Your solution sounds good to me generally, better if we can avoid the dependency, but still ok if not. Let's collect also others' opinions and try to reach a shared consensus. Tommaso > > Regards, > Suraj > > > > > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] > >wrote: > > > P.S., BSPJob (with table input) also the same. It's not only for > GraphJob. > > > > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> > > wrote: > > > All, > > > > > > I've also roughly described details about design of Graph APIs[1]. To > > > reduce our misunderstandings (please read first Partitioning and > > > GraphModuleInternals documents), > > > > > > * In NoSQLs case, there's obviously no need to Hash-partitioning or > > > rewrite partition files on HDFS. So, in these input cases, I think > > > vertex structure should be parsed at GraphJobRunner.loadVertices() > > > method. > > > > > > At here, we faced two options: 1) The current implementation of > > > 'PartitioningRunner' writes converted vertices on sequence format > > > partition files. And GraphJobRunner reads only Vertex Writable > > > objects. If input is table, we maybe have to skip the Partitioning job > > > and have to parse vertex structure at loadVertices() method after > > > checking some conditions. 2) PartitioningRunner just writes raw > > > records to proper partition files after checking its partition ID. And > > > GraphJobRunner.loadVertices() always parses and loads vertices. > > > > > > I was mean that I prefer the latter and there's no need to write > > > VertexWritable files. It's not related whether graph will support only > > > Seq format or not. Hope my explanation is enough! > > > > > > 1. http://wiki.apache.org/hama/GraphModuleInternals > > > > > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected] > > > > wrote: > > >> I've described my big picture here: > > http://wiki.apache.org/hama/Partitioning > > >> > > >> Please review and feedback whether this is acceptable. > > >> > > >> > > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: > > >>> p.s., i think theres mis understand. it doesn't mean that graph will > > support only sequence file format. Main is whether converting at > > patitioning stage or loadVertices stage. > > >>> > > >>> Sent from my iPhone > > >>> > > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> > wrote: > > >>> > > >>>> Sure, Please go ahead. > > >>>> > > >>>> > > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < > [email protected] > > >wrote: > > >>>> > > >>>>>>> Please let me know before this is changed, I would like to work > on > > a > > >>>>>>> separate branch. > > >>>>> > > >>>>> I personally, we have to focus on high priority tasks. and more > > >>>>> feedbacks and contributions from users. So, if changes made, I'll > > >>>>> release periodically. If you want to work on another place, please > > do. > > >>>>> I don't want to wait your patches. > > >>>>> > > >>>>> > > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < > > [email protected]> > > >>>>> wrote: > > >>>>>> For preparing integration with NoSQLs, of course, maybe condition > > >>>>>> check (whether converted or not) can be used without removing > record > > >>>>>> converter. > > >>>>>> > > >>>>>> We need to discuss everything. > > >>>>>> > > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < > [email protected] > > > > > >>>>> wrote: > > >>>>>>> I am still -1 if this means our graph module can work only on > > sequential > > >>>>>>> file format. > > >>>>>>> Please note that you can set record converter to null and make > > changes > > >>>>> to > > >>>>>>> loadVertices for what you desire here. > > >>>>>>> > > >>>>>>> If we came to this design, because TextInputFormat is > inefficient, > > would > > >>>>>>> this work for Avro or Thrift input format? > > >>>>>>> Please let me know before this is changed, I would like to work > on > > a > > >>>>>>> separate branch. > > >>>>>>> You may proceed as you wish. > > >>>>>>> > > >>>>>>> Regards, > > >>>>>>> Suraj > > >>>>>>> > > >>>>>>> > > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < > > [email protected] > > >>>>>> wrote: > > >>>>>>> > > >>>>>>>> I think 'record converter' should be removed. It's not good > idea. > > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input > > reader, we > > >>>>>>>> can move related classes into common module. > > >>>>>>>> > > >>>>>>>> Let's go with my original plan. > > >>>>>>>> > > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < > > [email protected]> > > >>>>>>>> wrote: > > >>>>>>>>> Hi all, > > >>>>>>>>> > > >>>>>>>>> I'm reading our old discussions about record converter, > superstep > > >>>>>>>>> injection, and common module: > > >>>>>>>>> > > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc > > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 > > >>>>>>>>> > > >>>>>>>>> To clarify goals and objectives: > > >>>>>>>>> > > >>>>>>>>> 1. A parallel input partition is necessary for obtaining > > scalability > > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's > > not a > > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't > > >>>>>>>>> shake). > > >>>>>>>>> 2. Input partitioning should be handled at BSP framework level, > > and > > >>>>> it > > >>>>>>>>> is for every Hama jobs, not only for Graph jobs. > > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs > input > > also > > >>>>>>>>> should be considered. > > >>>>>>>>> > > >>>>>>>>> The current problem is that every input of graph jobs should be > > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know. > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> Best Regards, Edward J. Yoon > > >>>>>>>>> @eddieyoon > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> -- > > >>>>>>>> Best Regards, Edward J. Yoon > > >>>>>>>> @eddieyoon > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> -- > > >>>>>> Best Regards, Edward J. Yoon > > >>>>>> @eddieyoon > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Best Regards, Edward J. Yoon > > >>>>> @eddieyoon > > >>>>> > > >> > > >> > > >> > > >> -- > > >> Best Regards, Edward J. Yoon > > >> @eddieyoon > > > > > > > > > > > > -- > > > Best Regards, Edward J. Yoon > > > @eddieyoon > > > > > > > > -- > > Best Regards, Edward J. Yoon > > @eddieyoon > > >
