I think this is a important step to move forward, but let's close this discussion by lazy consensus if no-one objects within the next three days.
On Tue, May 7, 2013 at 11:26 AM, Edward J. Yoon <[email protected]> wrote: > I've noted Suraj's suggestion and added my opinions, too - > http://wiki.apache.org/hama/Partitioning > > In this thread, please focus on the problem of integration with > NoSQLs. Since PartitioningRunner converts records of input data, and > GraphJobRunner reads converted records from partition files, Table > input must go unnecessarily through PartitioningRunner. That's the > problem of current "Partitioning and Record converter". > > > On Tue, May 7, 2013 at 9:50 AM, Edward J. Yoon <[email protected]> wrote: >> And, using of superstep API is a improvement or approach of partition >> processing. So, the main is whether we will parse vertex at bsp core >> or graph job runner. Please don't shake. >> >> On Tue, May 7, 2013 at 9:45 AM, Edward J. Yoon <[email protected]> wrote: >>> Hello all, >>> >>> a GSoC student who want to try to integrate NoSQLs with Graph looking >>> at this thread. My suggestion is not a quick fix solution. It's a >>> must. Please let me know whether you understand my suggestion or not. >>> >>> On Tue, May 7, 2013 at 9:38 AM, Edward J. Yoon <[email protected]> >>> wrote: >>>> Do you need also a separated Wiki? :-) If not, please feel free to >>>> describe your ideas on Wiki, dividing short-term/long-term plans. >>>> >>>> On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <[email protected]> >>>> wrote: >>>>> 1. Graph/Matrix data is small but Graph/Matrix algo requires huge >>>>> computations. Hence, the number of BSP processors should be able to >>>>> adjust ( != file blocks). >>>>> >>>>> 2. I'm -1 for using local disk to store partitions. HDFS is high cost. >>>>> But, reuse of partitions should be considered. >>>>> >>>>> On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili >>>>> <[email protected]> wrote: >>>>>> 2013/5/6 Suraj Menon <[email protected]> >>>>>> >>>>>>> I am assuming that the storage of vertices (NoSQL or any other format) >>>>>>> need >>>>>>> not be updated after every iteration. >>>>>>> >>>>>>> Based on the above assumption, I have the following suggestions: >>>>>>> >>>>>>> - Instead of running a separate job, we inject a partitioning superstep >>>>>>> before the first superstep of the job. (This has a dependency on the >>>>>>> Superstep API) >>>>>>> >>>>>> >>>>>> could we do that without introducing that dependency? I mean would that >>>>>> work also if not using the Superstep API on the client side? >>>>>> >>>>>> >>>>>>> - The partitions instead of being written to HDFS, which is creating a >>>>>>> copy >>>>>>> of input files in HDFS Cluster (too costly I believe), should be >>>>>>> written to >>>>>>> local files and read from. >>>>>>> >>>>>> >>>>>> +1 >>>>>> >>>>>> >>>>>>> - For graph jobs, we can configure this partitioning superstep class >>>>>>> specific to graph partitioning class that partitions and loads vertices. >>>>>>> >>>>>> >>>>>> this seems to be inline with the above assumption thus it probably makes >>>>>> sense. >>>>>> >>>>>> >>>>>>> >>>>>>> This sure has some dependencies. But would be a graceful solution and >>>>>>> can >>>>>>> tackle every problem. This is what I want to achieve in the end. Please >>>>>>> proceed if you have any intermediate ways to reach here faster. >>>>>>> >>>>>> >>>>>> Your solution sounds good to me generally, better if we can avoid the >>>>>> dependency, but still ok if not. >>>>>> Let's collect also others' opinions and try to reach a shared consensus. >>>>>> >>>>>> Tommaso >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Suraj >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected] >>>>>>> >wrote: >>>>>>> >>>>>>> > P.S., BSPJob (with table input) also the same. It's not only for >>>>>>> GraphJob. >>>>>>> > >>>>>>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]> >>>>>>> > wrote: >>>>>>> > > All, >>>>>>> > > >>>>>>> > > I've also roughly described details about design of Graph APIs[1]. >>>>>>> > > To >>>>>>> > > reduce our misunderstandings (please read first Partitioning and >>>>>>> > > GraphModuleInternals documents), >>>>>>> > > >>>>>>> > > * In NoSQLs case, there's obviously no need to Hash-partitioning or >>>>>>> > > rewrite partition files on HDFS. So, in these input cases, I think >>>>>>> > > vertex structure should be parsed at GraphJobRunner.loadVertices() >>>>>>> > > method. >>>>>>> > > >>>>>>> > > At here, we faced two options: 1) The current implementation of >>>>>>> > > 'PartitioningRunner' writes converted vertices on sequence format >>>>>>> > > partition files. And GraphJobRunner reads only Vertex Writable >>>>>>> > > objects. If input is table, we maybe have to skip the Partitioning >>>>>>> > > job >>>>>>> > > and have to parse vertex structure at loadVertices() method after >>>>>>> > > checking some conditions. 2) PartitioningRunner just writes raw >>>>>>> > > records to proper partition files after checking its partition ID. >>>>>>> > > And >>>>>>> > > GraphJobRunner.loadVertices() always parses and loads vertices. >>>>>>> > > >>>>>>> > > I was mean that I prefer the latter and there's no need to write >>>>>>> > > VertexWritable files. It's not related whether graph will support >>>>>>> > > only >>>>>>> > > Seq format or not. Hope my explanation is enough! >>>>>>> > > >>>>>>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals >>>>>>> > > >>>>>>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon >>>>>>> > > <[email protected] >>>>>>> > >>>>>>> > wrote: >>>>>>> > >> I've described my big picture here: >>>>>>> > http://wiki.apache.org/hama/Partitioning >>>>>>> > >> >>>>>>> > >> Please review and feedback whether this is acceptable. >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote: >>>>>>> > >>> p.s., i think theres mis understand. it doesn't mean that graph >>>>>>> > >>> will >>>>>>> > support only sequence file format. Main is whether converting at >>>>>>> > patitioning stage or loadVertices stage. >>>>>>> > >>> >>>>>>> > >>> Sent from my iPhone >>>>>>> > >>> >>>>>>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> >>>>>>> wrote: >>>>>>> > >>> >>>>>>> > >>>> Sure, Please go ahead. >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon < >>>>>>> [email protected] >>>>>>> > >wrote: >>>>>>> > >>>> >>>>>>> > >>>>>>> Please let me know before this is changed, I would like to >>>>>>> > >>>>>>> work >>>>>>> on >>>>>>> > a >>>>>>> > >>>>>>> separate branch. >>>>>>> > >>>>> >>>>>>> > >>>>> I personally, we have to focus on high priority tasks. and more >>>>>>> > >>>>> feedbacks and contributions from users. So, if changes made, >>>>>>> > >>>>> I'll >>>>>>> > >>>>> release periodically. If you want to work on another place, >>>>>>> > >>>>> please >>>>>>> > do. >>>>>>> > >>>>> I don't want to wait your patches. >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon < >>>>>>> > [email protected]> >>>>>>> > >>>>> wrote: >>>>>>> > >>>>>> For preparing integration with NoSQLs, of course, maybe >>>>>>> > >>>>>> condition >>>>>>> > >>>>>> check (whether converted or not) can be used without removing >>>>>>> record >>>>>>> > >>>>>> converter. >>>>>>> > >>>>>> >>>>>>> > >>>>>> We need to discuss everything. >>>>>>> > >>>>>> >>>>>>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon < >>>>>>> [email protected] >>>>>>> > > >>>>>>> > >>>>> wrote: >>>>>>> > >>>>>>> I am still -1 if this means our graph module can work only on >>>>>>> > sequential >>>>>>> > >>>>>>> file format. >>>>>>> > >>>>>>> Please note that you can set record converter to null and make >>>>>>> > changes >>>>>>> > >>>>> to >>>>>>> > >>>>>>> loadVertices for what you desire here. >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> If we came to this design, because TextInputFormat is >>>>>>> inefficient, >>>>>>> > would >>>>>>> > >>>>>>> this work for Avro or Thrift input format? >>>>>>> > >>>>>>> Please let me know before this is changed, I would like to >>>>>>> > >>>>>>> work >>>>>>> on >>>>>>> > a >>>>>>> > >>>>>>> separate branch. >>>>>>> > >>>>>>> You may proceed as you wish. >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> Regards, >>>>>>> > >>>>>>> Suraj >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> >>>>>>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon < >>>>>>> > [email protected] >>>>>>> > >>>>>> wrote: >>>>>>> > >>>>>>> >>>>>>> > >>>>>>>> I think 'record converter' should be removed. It's not good >>>>>>> idea. >>>>>>> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input >>>>>>> > reader, we >>>>>>> > >>>>>>>> can move related classes into common module. >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> Let's go with my original plan. >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon < >>>>>>> > [email protected]> >>>>>>> > >>>>>>>> wrote: >>>>>>> > >>>>>>>>> Hi all, >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> I'm reading our old discussions about record converter, >>>>>>> superstep >>>>>>> > >>>>>>>>> injection, and common module: >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc >>>>>>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4 >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> To clarify goals and objectives: >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> 1. A parallel input partition is necessary for obtaining >>>>>>> > scalability >>>>>>> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing >>>>>>> > >>>>>>>>> (It's >>>>>>> > not a >>>>>>> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please >>>>>>> > >>>>>>>>> don't >>>>>>> > >>>>>>>>> shake). >>>>>>> > >>>>>>>>> 2. Input partitioning should be handled at BSP framework >>>>>>> > >>>>>>>>> level, >>>>>>> > and >>>>>>> > >>>>> it >>>>>>> > >>>>>>>>> is for every Hama jobs, not only for Graph jobs. >>>>>>> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs >>>>>>> input >>>>>>> > also >>>>>>> > >>>>>>>>> should be considered. >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> The current problem is that every input of graph jobs >>>>>>> > >>>>>>>>> should be >>>>>>> > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me >>>>>>> > >>>>>>>>> know. >>>>>>> > >>>>>>>>> >>>>>>> > >>>>>>>>> -- >>>>>>> > >>>>>>>>> Best Regards, Edward J. Yoon >>>>>>> > >>>>>>>>> @eddieyoon >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> >>>>>>> > >>>>>>>> -- >>>>>>> > >>>>>>>> Best Regards, Edward J. Yoon >>>>>>> > >>>>>>>> @eddieyoon >>>>>>> > >>>>>> >>>>>>> > >>>>>> >>>>>>> > >>>>>> >>>>>>> > >>>>>> -- >>>>>>> > >>>>>> Best Regards, Edward J. Yoon >>>>>>> > >>>>>> @eddieyoon >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> >>>>>>> > >>>>> -- >>>>>>> > >>>>> Best Regards, Edward J. Yoon >>>>>>> > >>>>> @eddieyoon >>>>>>> > >>>>> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> -- >>>>>>> > >> Best Regards, Edward J. Yoon >>>>>>> > >> @eddieyoon >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > -- >>>>>>> > > Best Regards, Edward J. Yoon >>>>>>> > > @eddieyoon >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Best Regards, Edward J. Yoon >>>>>>> > @eddieyoon >>>>>>> > >>>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, Edward J. Yoon >>>>> @eddieyoon >>>> >>>> >>>> >>>> -- >>>> Best Regards, Edward J. Yoon >>>> @eddieyoon >>> >>> >>> >>> -- >>> Best Regards, Edward J. Yoon >>> @eddieyoon >> >> >> >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon -- Best Regards, Edward J. Yoon @eddieyoon
