Hello all, a GSoC student who want to try to integrate NoSQLs with Graph looking at this thread. My suggestion is not a quick fix solution. It's a must. Please let me know whether you understand my suggestion or not.
On Tue, May 7, 2013 at 9:38 AM, Edward J. Yoon <[email protected]> wrote: > Do you need also a separated Wiki? :-) If not, please feel free to > describe your ideas on Wiki, dividing short-term/long-term plans. > > On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <[email protected]> wrote: >> 1. Graph/Matrix data is small but Graph/Matrix algo requires huge >> computations. Hence, the number of BSP processors should be able to >> adjust ( != file blocks). >> >> 2. I'm -1 for using local disk to store partitions. HDFS is high cost. >> But, reuse of partitions should be considered. >> >> On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili >> <[email protected]> wrote: >>> 2013/5/6 Suraj Menon <[email protected]> >>> >>>> I am assuming that the storage of vertices (NoSQL or any other format) need >>>> not be updated after every iteration. >>>> >>>> Based on the above assumption, I have the following suggestions: >>>> >>>> - Instead of running a separate job, we inject a partitioning superstep >>>> before the first superstep of the job. (This has a dependency on the >>>> Superstep API) >>>> >>> >>> could we do that without introducing that dependency? I mean would that >>> work also if not using the Superstep API on the client side? >>> >>> >>>> - The partitions instead of being written to HDFS, which is creating a copy >>>> of input files in HDFS Cluster (too costly I believe), should be written to >>>> local files and read from. >>>> >>> >>> +1 >>> >>> >>>> - For graph jobs, we can configure this partitioning superstep class >>>> specific to graph partitioning class that partitions and loads vertices. >>>> >>> >>> this seems to be inline with the above assumption thus it probably makes >>> sense. >>> >>> >>>> >>>> This sure has some dependencies. But would be a graceful solution and can >>>> tackle every problem. This is what I want to achieve in the end. Please >>>> proceed if you have any intermediate ways to reach here faster. >>>> >>> >>> Your solution sounds good to me generally, better if we can avoid the >>> dependency, but still ok if not. >>> Let's collect also others' opinions and try to reach a shared consensus. >>> >>> Tommaso >>> >>> >>> >>> >>>> >>>> Regards, >>>> Suraj >>>> >>>> >>>> >>>> >>>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] >>>> >wrote: >>>> >>>> > P.S., BSPJob (with table input) also the same. It's not only for >>>> GraphJob. >>>> > >>>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> >>>> > wrote: >>>> > > All, >>>> > > >>>> > > I've also roughly described details about design of Graph APIs[1]. To >>>> > > reduce our misunderstandings (please read first Partitioning and >>>> > > GraphModuleInternals documents), >>>> > > >>>> > > * In NoSQLs case, there's obviously no need to Hash-partitioning or >>>> > > rewrite partition files on HDFS. So, in these input cases, I think >>>> > > vertex structure should be parsed at GraphJobRunner.loadVertices() >>>> > > method. >>>> > > >>>> > > At here, we faced two options: 1) The current implementation of >>>> > > 'PartitioningRunner' writes converted vertices on sequence format >>>> > > partition files. And GraphJobRunner reads only Vertex Writable >>>> > > objects. If input is table, we maybe have to skip the Partitioning job >>>> > > and have to parse vertex structure at loadVertices() method after >>>> > > checking some conditions. 2) PartitioningRunner just writes raw >>>> > > records to proper partition files after checking its partition ID. And >>>> > > GraphJobRunner.loadVertices() always parses and loads vertices. >>>> > > >>>> > > I was mean that I prefer the latter and there's no need to write >>>> > > VertexWritable files. It's not related whether graph will support only >>>> > > Seq format or not. Hope my explanation is enough! >>>> > > >>>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals >>>> > > >>>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected] >>>> > >>>> > wrote: >>>> > >> I've described my big picture here: >>>> > http://wiki.apache.org/hama/Partitioning >>>> > >> >>>> > >> Please review and feedback whether this is acceptable. >>>> > >> >>>> > >> >>>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: >>>> > >>> p.s., i think theres mis understand. it doesn't mean that graph will >>>> > support only sequence file format. Main is whether converting at >>>> > patitioning stage or loadVertices stage. >>>> > >>> >>>> > >>> Sent from my iPhone >>>> > >>> >>>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> >>>> wrote: >>>> > >>> >>>> > >>>> Sure, Please go ahead. >>>> > >>>> >>>> > >>>> >>>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < >>>> [email protected] >>>> > >wrote: >>>> > >>>> >>>> > >>>>>>> Please let me know before this is changed, I would like to work >>>> on >>>> > a >>>> > >>>>>>> separate branch. >>>> > >>>>> >>>> > >>>>> I personally, we have to focus on high priority tasks. and more >>>> > >>>>> feedbacks and contributions from users. So, if changes made, I'll >>>> > >>>>> release periodically. If you want to work on another place, please >>>> > do. >>>> > >>>>> I don't want to wait your patches. >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < >>>> > [email protected]> >>>> > >>>>> wrote: >>>> > >>>>>> For preparing integration with NoSQLs, of course, maybe condition >>>> > >>>>>> check (whether converted or not) can be used without removing >>>> record >>>> > >>>>>> converter. >>>> > >>>>>> >>>> > >>>>>> We need to discuss everything. >>>> > >>>>>> >>>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < >>>> [email protected] >>>> > > >>>> > >>>>> wrote: >>>> > >>>>>>> I am still -1 if this means our graph module can work only on >>>> > sequential >>>> > >>>>>>> file format. >>>> > >>>>>>> Please note that you can set record converter to null and make >>>> > changes >>>> > >>>>> to >>>> > >>>>>>> loadVertices for what you desire here. >>>> > >>>>>>> >>>> > >>>>>>> If we came to this design, because TextInputFormat is >>>> inefficient, >>>> > would >>>> > >>>>>>> this work for Avro or Thrift input format? >>>> > >>>>>>> Please let me know before this is changed, I would like to work >>>> on >>>> > a >>>> > >>>>>>> separate branch. >>>> > >>>>>>> You may proceed as you wish. >>>> > >>>>>>> >>>> > >>>>>>> Regards, >>>> > >>>>>>> Suraj >>>> > >>>>>>> >>>> > >>>>>>> >>>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < >>>> > [email protected] >>>> > >>>>>> wrote: >>>> > >>>>>>> >>>> > >>>>>>>> I think 'record converter' should be removed. It's not good >>>> idea. >>>> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input >>>> > reader, we >>>> > >>>>>>>> can move related classes into common module. >>>> > >>>>>>>> >>>> > >>>>>>>> Let's go with my original plan. >>>> > >>>>>>>> >>>> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < >>>> > [email protected]> >>>> > >>>>>>>> wrote: >>>> > >>>>>>>>> Hi all, >>>> > >>>>>>>>> >>>> > >>>>>>>>> I'm reading our old discussions about record converter, >>>> superstep >>>> > >>>>>>>>> injection, and common module: >>>> > >>>>>>>>> >>>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc >>>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 >>>> > >>>>>>>>> >>>> > >>>>>>>>> To clarify goals and objectives: >>>> > >>>>>>>>> >>>> > >>>>>>>>> 1. A parallel input partition is necessary for obtaining >>>> > scalability >>>> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's >>>> > not a >>>> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't >>>> > >>>>>>>>> shake). >>>> > >>>>>>>>> 2. Input partitioning should be handled at BSP framework level, >>>> > and >>>> > >>>>> it >>>> > >>>>>>>>> is for every Hama jobs, not only for Graph jobs. >>>> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs >>>> input >>>> > also >>>> > >>>>>>>>> should be considered. >>>> > >>>>>>>>> >>>> > >>>>>>>>> The current problem is that every input of graph jobs should be >>>> > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know. >>>> > >>>>>>>>> >>>> > >>>>>>>>> -- >>>> > >>>>>>>>> Best Regards, Edward J. Yoon >>>> > >>>>>>>>> @eddieyoon >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> >>>> > >>>>>>>> -- >>>> > >>>>>>>> Best Regards, Edward J. Yoon >>>> > >>>>>>>> @eddieyoon >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> -- >>>> > >>>>>> Best Regards, Edward J. Yoon >>>> > >>>>>> @eddieyoon >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> >>>> > >>>>> -- >>>> > >>>>> Best Regards, Edward J. Yoon >>>> > >>>>> @eddieyoon >>>> > >>>>> >>>> > >> >>>> > >> >>>> > >> >>>> > >> -- >>>> > >> Best Regards, Edward J. Yoon >>>> > >> @eddieyoon >>>> > > >>>> > > >>>> > > >>>> > > -- >>>> > > Best Regards, Edward J. Yoon >>>> > > @eddieyoon >>>> > >>>> > >>>> > >>>> > -- >>>> > Best Regards, Edward J. Yoon >>>> > @eddieyoon >>>> > >>>> >> >> >> >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon -- Best Regards, Edward J. Yoon @eddieyoon
