1. Graph/Matrix data is small but Graph/Matrix algo requires huge computations. Hence, the number of BSP processors should be able to adjust ( != file blocks).
2. I'm -1 for using local disk to store partitions. HDFS is high cost. But, reuse of partitions should be considered. On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili <[email protected]> wrote: > 2013/5/6 Suraj Menon <[email protected]> > >> I am assuming that the storage of vertices (NoSQL or any other format) need >> not be updated after every iteration. >> >> Based on the above assumption, I have the following suggestions: >> >> - Instead of running a separate job, we inject a partitioning superstep >> before the first superstep of the job. (This has a dependency on the >> Superstep API) >> > > could we do that without introducing that dependency? I mean would that > work also if not using the Superstep API on the client side? > > >> - The partitions instead of being written to HDFS, which is creating a copy >> of input files in HDFS Cluster (too costly I believe), should be written to >> local files and read from. >> > > +1 > > >> - For graph jobs, we can configure this partitioning superstep class >> specific to graph partitioning class that partitions and loads vertices. >> > > this seems to be inline with the above assumption thus it probably makes > sense. > > >> >> This sure has some dependencies. But would be a graceful solution and can >> tackle every problem. This is what I want to achieve in the end. Please >> proceed if you have any intermediate ways to reach here faster. >> > > Your solution sounds good to me generally, better if we can avoid the > dependency, but still ok if not. > Let's collect also others' opinions and try to reach a shared consensus. > > Tommaso > > > > >> >> Regards, >> Suraj >> >> >> >> >> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] >> >wrote: >> >> > P.S., BSPJob (with table input) also the same. It's not only for >> GraphJob. >> > >> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> >> > wrote: >> > > All, >> > > >> > > I've also roughly described details about design of Graph APIs[1]. To >> > > reduce our misunderstandings (please read first Partitioning and >> > > GraphModuleInternals documents), >> > > >> > > * In NoSQLs case, there's obviously no need to Hash-partitioning or >> > > rewrite partition files on HDFS. So, in these input cases, I think >> > > vertex structure should be parsed at GraphJobRunner.loadVertices() >> > > method. >> > > >> > > At here, we faced two options: 1) The current implementation of >> > > 'PartitioningRunner' writes converted vertices on sequence format >> > > partition files. And GraphJobRunner reads only Vertex Writable >> > > objects. If input is table, we maybe have to skip the Partitioning job >> > > and have to parse vertex structure at loadVertices() method after >> > > checking some conditions. 2) PartitioningRunner just writes raw >> > > records to proper partition files after checking its partition ID. And >> > > GraphJobRunner.loadVertices() always parses and loads vertices. >> > > >> > > I was mean that I prefer the latter and there's no need to write >> > > VertexWritable files. It's not related whether graph will support only >> > > Seq format or not. Hope my explanation is enough! >> > > >> > > 1. http://wiki.apache.org/hama/GraphModuleInternals >> > > >> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected] >> > >> > wrote: >> > >> I've described my big picture here: >> > http://wiki.apache.org/hama/Partitioning >> > >> >> > >> Please review and feedback whether this is acceptable. >> > >> >> > >> >> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: >> > >>> p.s., i think theres mis understand. it doesn't mean that graph will >> > support only sequence file format. Main is whether converting at >> > patitioning stage or loadVertices stage. >> > >>> >> > >>> Sent from my iPhone >> > >>> >> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> >> wrote: >> > >>> >> > >>>> Sure, Please go ahead. >> > >>>> >> > >>>> >> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < >> [email protected] >> > >wrote: >> > >>>> >> > >>>>>>> Please let me know before this is changed, I would like to work >> on >> > a >> > >>>>>>> separate branch. >> > >>>>> >> > >>>>> I personally, we have to focus on high priority tasks. and more >> > >>>>> feedbacks and contributions from users. So, if changes made, I'll >> > >>>>> release periodically. If you want to work on another place, please >> > do. >> > >>>>> I don't want to wait your patches. >> > >>>>> >> > >>>>> >> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < >> > [email protected]> >> > >>>>> wrote: >> > >>>>>> For preparing integration with NoSQLs, of course, maybe condition >> > >>>>>> check (whether converted or not) can be used without removing >> record >> > >>>>>> converter. >> > >>>>>> >> > >>>>>> We need to discuss everything. >> > >>>>>> >> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < >> [email protected] >> > > >> > >>>>> wrote: >> > >>>>>>> I am still -1 if this means our graph module can work only on >> > sequential >> > >>>>>>> file format. >> > >>>>>>> Please note that you can set record converter to null and make >> > changes >> > >>>>> to >> > >>>>>>> loadVertices for what you desire here. >> > >>>>>>> >> > >>>>>>> If we came to this design, because TextInputFormat is >> inefficient, >> > would >> > >>>>>>> this work for Avro or Thrift input format? >> > >>>>>>> Please let me know before this is changed, I would like to work >> on >> > a >> > >>>>>>> separate branch. >> > >>>>>>> You may proceed as you wish. >> > >>>>>>> >> > >>>>>>> Regards, >> > >>>>>>> Suraj >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < >> > [email protected] >> > >>>>>> wrote: >> > >>>>>>> >> > >>>>>>>> I think 'record converter' should be removed. It's not good >> idea. >> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input >> > reader, we >> > >>>>>>>> can move related classes into common module. >> > >>>>>>>> >> > >>>>>>>> Let's go with my original plan. >> > >>>>>>>> >> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < >> > [email protected]> >> > >>>>>>>> wrote: >> > >>>>>>>>> Hi all, >> > >>>>>>>>> >> > >>>>>>>>> I'm reading our old discussions about record converter, >> superstep >> > >>>>>>>>> injection, and common module: >> > >>>>>>>>> >> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc >> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 >> > >>>>>>>>> >> > >>>>>>>>> To clarify goals and objectives: >> > >>>>>>>>> >> > >>>>>>>>> 1. A parallel input partition is necessary for obtaining >> > scalability >> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's >> > not a >> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't >> > >>>>>>>>> shake). >> > >>>>>>>>> 2. Input partitioning should be handled at BSP framework level, >> > and >> > >>>>> it >> > >>>>>>>>> is for every Hama jobs, not only for Graph jobs. >> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs >> input >> > also >> > >>>>>>>>> should be considered. >> > >>>>>>>>> >> > >>>>>>>>> The current problem is that every input of graph jobs should be >> > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know. >> > >>>>>>>>> >> > >>>>>>>>> -- >> > >>>>>>>>> Best Regards, Edward J. Yoon >> > >>>>>>>>> @eddieyoon >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> -- >> > >>>>>>>> Best Regards, Edward J. Yoon >> > >>>>>>>> @eddieyoon >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> -- >> > >>>>>> Best Regards, Edward J. Yoon >> > >>>>>> @eddieyoon >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> -- >> > >>>>> Best Regards, Edward J. Yoon >> > >>>>> @eddieyoon >> > >>>>> >> > >> >> > >> >> > >> >> > >> -- >> > >> Best Regards, Edward J. Yoon >> > >> @eddieyoon >> > > >> > > >> > > >> > > -- >> > > Best Regards, Edward J. Yoon >> > > @eddieyoon >> > >> > >> > >> > -- >> > Best Regards, Edward J. Yoon >> > @eddieyoon >> > >> -- Best Regards, Edward J. Yoon @eddieyoon
