Keeping run-time (network-based) partitioning within GraphJobRunner is not good idea.
>> - the number of splits found are not equal to the number of BSP tasks >> configured for the job. OR I have a question. If the input is unsorted map and I want to re-partition by hashing but the numbers of blocks and desired tasks are same, then what happens? Do you mean run-time partitioning? On Wed, Jan 9, 2013 at 7:07 AM, Suraj Menon <[email protected]> wrote: > Hi Apurv, yes, those are pending test cases to be fixed. GraphJobRunner is > expecting the input in the format of Vertex, but we have input files as > well as record key, values defined as Text. I have fixed only one unit test > case yet. > > On Tue, Jan 8, 2013 at 4:45 PM, Apurv Verma <[email protected]> wrote: > >> Hey all, >> I got the problem, the partitioner was not being set for the >> PartitionerRunner bsp task. :P I have fixed the partitioner with portions >> from your patch Suraj. Now after this commit partitioner will obey what you >> specified earlier, just to recapitulate. >> >> Repartitioning is done if : >> - the number of splits found are not equal to the number of BSP tasks >> configured for the job. OR >> - the flag is set to true by the user ("bsp.input.runtime.partitioning") OR >> - user has specified a Runtime Partitioner class and enabled runtime >> partitioning >> >> There was one special thing that I discovered about partitioner , just >> sharing with you guys. Suppose I implement a partitioner which returns 0 >> for a record, then it isn't necessary that this record will go to peer with >> index 0. It might go to peer 1. The only certitude which partitioner's >> provide is that all records returning 0 will go to the same peer. I needed >> partitioner to work for PrefixSum I was implementing. >> >> Things to do next. >> 1) RecordConverter , which Suraj is implementing in HAMA-700. (Please >> update Suraj) >> >> B.T.W there are problems in mvn test. >> *java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to >> org.apache.hadoop.io.ArrayWritable* >> * at >> org.apache.hama.graph.GraphJobRunner.loadVertices(GraphJobRunner.java:287)* >> * >> * >> I don't think my commit is breaking this. >> >> Thanks >> >> >> -- >> Regards, >> Apurv Verma >> >> >> >> >> On Tue, Jan 8, 2013 at 11:07 PM, Suraj Menon <[email protected]> >> wrote: >> >> > Please explain the nature of problems you are facing with Partitioner? >> > >> > >Any reasons for deciding to move the >> > > PartitioningJob inside BSPJobClient from BSPJob? >> > >> > Twofold, BSPJob was just a configuration holder object, didn't want to >> add >> > the partitioning responsibility to the class. >> > And also I wanted to know the number of splits, before taking the >> decision >> > whether to repartition or not. >> > Repartitioning is done if : >> > - the number of splits found are not equal to the number of BSP tasks >> > configured for the job. OR >> > - the flag is set to true by the user ("bsp.input.runtime.partitioning") >> OR >> > - user has specified a Runtime Partitioner class and enabled runtime >> > partitioning >> > >> > Thanks, >> > Suraj >> > >> > On Tue, Jan 8, 2013 at 11:31 AM, Apurv Verma <[email protected]> wrote: >> > >> > > Thanks, let me have a careful look at it. On a cursory look, I seem to >> > > understand the basic idea. Any reasons for deciding to move the >> > > PartitioningJob inside BSPJobClient from BSPJob? >> > > BTW the current partitioner doesn't work as intended, only the default >> > > partitioner HashPartitioner works fine, if I try to put some custom >> > > partitioner there are problems. >> > > >> > > Let's resolve the partitioning completely before the spilling message >> > > queue. >> > > >> > > >> > > -- >> > > Regards, >> > > Apurv Verma >> > > >> > > >> > > >> > > >> > > On Tue, Jan 8, 2013 at 8:39 PM, Suraj Menon <[email protected]> >> > > wrote: >> > > >> > > > Hey Apurv, please check HAMA-700.patch_Jan7. Feel free to provide >> > > > suggestions or even work on it. >> > > > >> > > > Thanks, >> > > > Suraj >> > > > >> > > > On Tue, Jan 8, 2013 at 9:21 AM, Apurv Verma <[email protected]> >> wrote: >> > > > >> > > > > Hey Edward, >> > > > > There was a compile bug which i fixed temporarily. isPartitioned >> was >> > > not >> > > > > being initialized. Could you please check the last commit. I have >> > > > currently >> > > > > initialized it to false but I guess this should be configurable. >> > > > > There was some jira where we wanted partitioning to be skipped if >> > user >> > > > > thinks his data is already partitioned. >> > > > > >> > > > > Thanks again. >> > > > > >> > > > > >> > > > > -- >> > > > > Regards, >> > > > > Apurv Verma >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Tue, Jan 8, 2013 at 3:44 PM, Edward J. Yoon < >> > [email protected] >> > > > > >wrote: >> > > > > >> > > > > > Thanks, then I'll finish tomorrow. Please feel free to comment >> > there. >> > > > > > >> > > > > > On Tue, Jan 8, 2013 at 7:04 PM, Tommaso Teofili >> > > > > > <[email protected]> wrote: >> > > > > > > thanks Edward, it looks good. >> > > > > > > Tommaso >> > > > > > > >> > > > > > > >> > > > > > > 2013/1/8 Edward J. Yoon <[email protected]> >> > > > > > > >> > > > > > >> Please review this: >> > > > > > >> >> > > > > > >> http://wiki.apache.org/hama/Partitioning >> > > > > > >> >> > > > > > >> On Mon, Jan 7, 2013 at 6:17 AM, Edward J. Yoon < >> > > > [email protected] >> > > > > > >> > > > > > >> wrote: >> > > > > > >> > I mean, the pre-partitioning or resizing partitions is >> really >> > > > > > important. >> > > > > > >> > >> > > > > > >> > On Mon, Jan 7, 2013 at 6:15 AM, Edward J. Yoon < >> > > > > [email protected] >> > > > > > > >> > > > > > >> wrote: >> > > > > > >> >> This is another talk ... >> > > > > > >> >> >> > > > > > >> >> Unlike MapReduce, I think, Hama BSP will handle tasks that >> > > input >> > > > is >> > > > > > >> >> small in size but large in computational complexity, such >> as >> > > > graph, >> > > > > > >> >> sparse matrix, machine learning algorithms. >> > > > > > >> >> >> > > > > > >> >> On Mon, Jan 7, 2013 at 5:57 AM, Edward J. Yoon < >> > > > > > [email protected]> >> > > > > > >> wrote: >> > > > > > >> >>> Even though the numbers of splits and tasks are the same, >> > > > > > user-defined >> > > > > > >> >>> partitioning job should be run (because it is not only for >> > > > > resizing >> > > > > > >> >>> partitions. For example, range partitioning of unsorted >> data >> > > set >> > > > > or >> > > > > > >> >>> hash key partitioning, ..., etc). >> > > > > > >> >>> >> > > > > > >> >>> On Mon, Jan 7, 2013 at 5:28 AM, Suraj Menon < >> > > > > [email protected] >> > > > > > > >> > > > > > >> wrote: >> > > > > > >> >>>>> 1. I am referring to >> > > > org.apache.hama.bsp.PartitioningRunner, >> > > > > > it's >> > > > > > >> named >> > > > > > >> >>>>> as so in the HEAD (1429573) of trunk. It isn't >> removed >> > > but >> > > > it >> > > > > > >> isn't >> > > > > > >> >>>>> referred to anywhere else. I can't find any >> references >> > to >> > > > it >> > > > > in >> > > > > > >> the >> > > > > > >> >>>>> workspace. >> > > > > > >> >>>>> >> > > > > > >> >>>> >> > > > > > >> >>>> It is referred in BSPJob#waitForCompletion function as a >> > > > separate >> > > > > > BSP >> > > > > > >> job >> > > > > > >> >>>> to create the specified splits. >> > > > > > >> >>>> >> > > > > > >> >>>> >> > > > > > >> >>>>> 2. job.setPartitioner is the same as setting >> > > > > > >> >>>>> "bsp.input.partitioner.class" . Anyways , So acc. to >> me >> > > > > > >> partitions are >> > > > > > >> >>>>> not >> > > > > > >> >>>>> being created because of which the following happens. >> > > > > > >> >>>>> If I am running the task on local fs and not hdfs, >> > > there's >> > > > > just >> > > > > > >> one >> > > > > > >> >>>>> input split and even if I set a partitioner to create >> > two >> > > > > > >> partitions and >> > > > > > >> >>>>> set bsp.setNumTasks(2) , this is overriden and only >> one >> > > > task >> > > > > is >> > > > > > >> >>>>> executed. >> > > > > > >> >>>>> See BSPJobClient#submitJobInternal() >> > > > > > >> >>>>> where it does the following >> > > > > > >> >>>>> job.setNumBspTask(writeSplits(job, submitSplitFile, >> > > > > maxTasks)); >> > > > > > >> Line >> > > > > > >> >>>>> 326. >> > > > > > >> >>>>> >> > > > > > >> >>>>> This job is set to run if the number of splits != number >> > of >> > > > > Tasks >> > > > > > or >> > > > > > >> if >> > > > > > >> >>>> forced by the configuration. I can share my HAMA-700 >> > current >> > > > > state >> > > > > > of >> > > > > > >> patch >> > > > > > >> >>>> with you. >> > > > > > >> >>>> >> > > > > > >> >>>> >> > > > > > >> >>>>> 3. So here is what I think is happening, Partitioner >> is >> > > not >> > > > > in >> > > > > > the >> > > > > > >> >>>>> codepath (try putting a breakpoint inside the >> > partitioner >> > > > and >> > > > > > >> executing >> > > > > > >> >>>>> and >> > > > > > >> >>>>> non graph bsp task), so partitions are not being >> > created >> > > > and >> > > > > > >> >>>>> writeSplits() >> > > > > > >> >>>>> is returning 1. >> > > > > > >> >>>>> [ writeSplits() returns the number of splits in the >> > > input. >> > > > ] >> > > > > > >> >>>>> >> > > > > > >> >>>> >> > > > > > >> >>>> Probably because it is running as a separate process? >> > > > > > >> >>> >> > > > > > >> >>> >> > > > > > >> >>> >> > > > > > >> >>> -- >> > > > > > >> >>> Best Regards, Edward J. Yoon >> > > > > > >> >>> @eddieyoon >> > > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> -- >> > > > > > >> >> Best Regards, Edward J. Yoon >> > > > > > >> >> @eddieyoon >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > -- >> > > > > > >> > Best Regards, Edward J. Yoon >> > > > > > >> > @eddieyoon >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> -- >> > > > > > >> Best Regards, Edward J. Yoon >> > > > > > >> @eddieyoon >> > > > > > >> >> > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Best Regards, Edward J. Yoon >> > > > > > @eddieyoon >> > > > > > >> > > > > >> > > > >> > > >> > >> -- Best Regards, Edward J. Yoon @eddieyoon
