Re: [#131] Supporting Hadoop data and cluster management
On 22 May 2015, at 3:26, Efi wrote: Thank you for the recursively tag check, Steven told me about it yesterday as well.I hadnt thought of it so far but I will think of ways to implement it for these methods so it does not create problems. My question was not exactly that, I was considering if the query engine could parse data that have complete elements but miss other tags from greater elements. For example, one data that comes from either of these methods can look like this: books book ... /book And another one like this: book /book ... /books The query is about data inside the element book, will these work with the query engine? I would hope so. I assume, that everything before the fist book and between a /book and the next book should be ignored. And everything between a book and a /book is probably parsed and passed to the query engine. Does that make sense? About your answer for the scenario where a block does not contain the tags in question, it can mean two things.It is not part of the element we want to work with,so we simply ignore it, or it is part of the element but the starting and ending tags are in previous/next blocks. So this block contains only part of the body that we want.In that case it will be parsed only by the readers that are assigned to read the block that contains the starting tag of this element. Yes, that sounds right. On that note, I am currently working on a way to assign only one reader to each block, because hdfs assigns readers according to the available cores of the CPUs you use.That means the same block can be assigned to more than one readers and in our case that can lead to memory problems. I'm not sure I fully understand the current design. Could you explain in a little more detail in which case you see which problem coming up (I can imagine a number of problems with memory ...)? Cheers, Till On 22/05/2015 06:53 πμ, Till Westmann wrote: (1) I agree that [1] looks better (thanks for the diagrams - we should add them to the docs!). (2) I think that it’s ok to have the restriction, that the given tag (a) identifies the root element of the elements that we want to work with and (b) is not used recursively (and I would check this condition and fail if it doesn’t hold). If we have a few really big nodes in the file, we anyway do not have a way to process them in parallel, so the chosen tags should split the document into a large number of smaller pieces for VXQuery to work well. Wrt. to the question what happens if we start reading a block that does not contain the tag(s) in question (I think that that’s the last question - please correct me if I’m wrong) it would probably be read without producing any nodes that will be processed by the query engine. So the effort to do that would be wasted, but I would expect that the block would then be parsed again as the continuation of another block that contained a start tag. Till On May 21, 2015, at 2:59 PM, Steven Jacobs sjaco...@ucr.edu wrote: This seems correct to me. Since our objective in implementing HDFS is to deal with very large XML files, I think we should avoid any size limitations. Regarding the tags, does anyone have any thoughts on this? In the case of searching for all elements with a given name regardless of depth, this method will work fine, but if we want a specific path, we could end up opening lots of Blocks to guarantee path correctness, the entire file in fact. Steven On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote: Hello everyone, For this week the two different methods for reading complete items according to a specific tag are completed and tested in standalone hdfs deployment.In detail what each method does: The first method, I call it One Buffer Method, reads a block, saves it in a buffer, and continues reading from the other blocks until it finds a specific closing tag.It shows good results and good times in the tests. The second method, called Shared File Method, reads only the complete items contained in the block and the incomplete items from the start and end of the block are send to a shared file in the hdfs Distributed Cache. Now this method could work only for relatively small inputs, since the Distributed Cache is limited and in the case of hundreds/thousands of blocks the shared file can exceed the limit. I took the liberty of creating diagrams that show in example what each method does. [1] One Buffer Method [2] Shared File Method Every insight and feedback is more than welcome about these two methods.In my opinion the One Buffer method is simpler and more effective since it can be used for both small and large datasets. There is also a question, can the parser work on data that are missing some tags?For example the first and last tag of the xml file that are located in different blocks. Best regards, Efi [1]
Re: [#131]Supporting Hadoop data and cluster management
Thank you for the recursively tag check, Steven told me about it yesterday as well.I hadnt thought of it so far but I will think of ways to implement it for these methods so it does not create problems. My question was not exactly that, I was considering if the query engine could parse data that have complete elements but miss other tags from greater elements. For example, one data that comes from either of these methods can look like this: books book ... /book And another one like this: book /book ... /books The query is about data inside the element book, will these work with the query engine? About your answer for the scenario where a block does not contain the tags in question, it can mean two things.It is not part of the element we want to work with,so we simply ignore it, or it is part of the element but the starting and ending tags are in previous/next blocks. So this block contains only part of the body that we want.In that case it will be parsed only by the readers that are assigned to read the block that contains the starting tag of this element. On that note, I am currently working on a way to assign only one reader to each block, because hdfs assigns readers according to the available cores of the CPUs you use.That means the same block can be assigned to more than one readers and in our case that can lead to memory problems. Efi On 22/05/2015 06:53 πμ, Till Westmann wrote: (1) I agree that [1] looks better (thanks for the diagrams - we should add them to the docs!). (2) I think that it’s ok to have the restriction, that the given tag (a) identifies the root element of the elements that we want to work with and (b) is not used recursively (and I would check this condition and fail if it doesn’t hold). If we have a few really big nodes in the file, we anyway do not have a way to process them in parallel, so the chosen tags should split the document into a large number of smaller pieces for VXQuery to work well. Wrt. to the question what happens if we start reading a block that does not contain the tag(s) in question (I think that that’s the last question - please correct me if I’m wrong) it would probably be read without producing any nodes that will be processed by the query engine. So the effort to do that would be wasted, but I would expect that the block would then be parsed again as the continuation of another block that contained a start tag. Till On May 21, 2015, at 2:59 PM, Steven Jacobs sjaco...@ucr.edu wrote: This seems correct to me. Since our objective in implementing HDFS is to deal with very large XML files, I think we should avoid any size limitations. Regarding the tags, does anyone have any thoughts on this? In the case of searching for all elements with a given name regardless of depth, this method will work fine, but if we want a specific path, we could end up opening lots of Blocks to guarantee path correctness, the entire file in fact. Steven On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote: Hello everyone, For this week the two different methods for reading complete items according to a specific tag are completed and tested in standalone hdfs deployment.In detail what each method does: The first method, I call it One Buffer Method, reads a block, saves it in a buffer, and continues reading from the other blocks until it finds a specific closing tag.It shows good results and good times in the tests. The second method, called Shared File Method, reads only the complete items contained in the block and the incomplete items from the start and end of the block are send to a shared file in the hdfs Distributed Cache. Now this method could work only for relatively small inputs, since the Distributed Cache is limited and in the case of hundreds/thousands of blocks the shared file can exceed the limit. I took the liberty of creating diagrams that show in example what each method does. [1] One Buffer Method [2] Shared File Method Every insight and feedback is more than welcome about these two methods.In my opinion the One Buffer method is simpler and more effective since it can be used for both small and large datasets. There is also a question, can the parser work on data that are missing some tags?For example the first and last tag of the xml file that are located in different blocks. Best regards, Efi [1] https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing [2] https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing On 05/19/2015 12:43 AM, Michael Carey wrote: +1 Sounds great! On 5/18/15 8:33 AM, Steven Jacobs wrote: Great work! Steven On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote: Hello everyone, This is my update on what I have been doing this last week: Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from
Re: [#131]Supporting Hadoop data and cluster management
This seems correct to me. Since our objective in implementing HDFS is to deal with very large XML files, I think we should avoid any size limitations. Regarding the tags, does anyone have any thoughts on this? In the case of searching for all elements with a given name regardless of depth, this method will work fine, but if we want a specific path, we could end up opening lots of Blocks to guarantee path correctness, the entire file in fact. Steven On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote: Hello everyone, For this week the two different methods for reading complete items according to a specific tag are completed and tested in standalone hdfs deployment.In detail what each method does: The first method, I call it One Buffer Method, reads a block, saves it in a buffer, and continues reading from the other blocks until it finds a specific closing tag.It shows good results and good times in the tests. The second method, called Shared File Method, reads only the complete items contained in the block and the incomplete items from the start and end of the block are send to a shared file in the hdfs Distributed Cache. Now this method could work only for relatively small inputs, since the Distributed Cache is limited and in the case of hundreds/thousands of blocks the shared file can exceed the limit. I took the liberty of creating diagrams that show in example what each method does. [1] One Buffer Method [2] Shared File Method Every insight and feedback is more than welcome about these two methods.In my opinion the One Buffer method is simpler and more effective since it can be used for both small and large datasets. There is also a question, can the parser work on data that are missing some tags?For example the first and last tag of the xml file that are located in different blocks. Best regards, Efi [1] https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing [2] https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing On 05/19/2015 12:43 AM, Michael Carey wrote: +1 Sounds great! On 5/18/15 8:33 AM, Steven Jacobs wrote: Great work! Steven On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote: Hello everyone, This is my update on what I have been doing this last week: Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from blocks located in HDFS and returns complete items according to a specified xml tag. I also tested this class in a standalone hadoop cluster with xml files of various sizes, the smallest being a single file of 400 MB and the largest a collection of 5 files totalling 6.1 GB. This week I will create another implementation of the XMLInputFormat with a different way of reading and delivering files, the way I described in the same issue and I will test both solutions in a standalone and a small hadoop cluster (5-6 nodes). You can see this week's results here [2] .I will keep updating this file about the other tests. Best regards, Efi [1] https://issues.apache.org/jira/browse/VXQUERY-131 [2] https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
Re: [#131]Supporting Hadoop data and cluster management
(1) I agree that [1] looks better (thanks for the diagrams - we should add them to the docs!). (2) I think that it’s ok to have the restriction, that the given tag (a) identifies the root element of the elements that we want to work with and (b) is not used recursively (and I would check this condition and fail if it doesn’t hold). If we have a few really big nodes in the file, we anyway do not have a way to process them in parallel, so the chosen tags should split the document into a large number of smaller pieces for VXQuery to work well. Wrt. to the question what happens if we start reading a block that does not contain the tag(s) in question (I think that that’s the last question - please correct me if I’m wrong) it would probably be read without producing any nodes that will be processed by the query engine. So the effort to do that would be wasted, but I would expect that the block would then be parsed again as the continuation of another block that contained a start tag. Till On May 21, 2015, at 2:59 PM, Steven Jacobs sjaco...@ucr.edu wrote: This seems correct to me. Since our objective in implementing HDFS is to deal with very large XML files, I think we should avoid any size limitations. Regarding the tags, does anyone have any thoughts on this? In the case of searching for all elements with a given name regardless of depth, this method will work fine, but if we want a specific path, we could end up opening lots of Blocks to guarantee path correctness, the entire file in fact. Steven On Thu, May 21, 2015 at 10:20 AM, Efi efika...@gmail.com wrote: Hello everyone, For this week the two different methods for reading complete items according to a specific tag are completed and tested in standalone hdfs deployment.In detail what each method does: The first method, I call it One Buffer Method, reads a block, saves it in a buffer, and continues reading from the other blocks until it finds a specific closing tag.It shows good results and good times in the tests. The second method, called Shared File Method, reads only the complete items contained in the block and the incomplete items from the start and end of the block are send to a shared file in the hdfs Distributed Cache. Now this method could work only for relatively small inputs, since the Distributed Cache is limited and in the case of hundreds/thousands of blocks the shared file can exceed the limit. I took the liberty of creating diagrams that show in example what each method does. [1] One Buffer Method [2] Shared File Method Every insight and feedback is more than welcome about these two methods.In my opinion the One Buffer method is simpler and more effective since it can be used for both small and large datasets. There is also a question, can the parser work on data that are missing some tags?For example the first and last tag of the xml file that are located in different blocks. Best regards, Efi [1] https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing [2] https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing On 05/19/2015 12:43 AM, Michael Carey wrote: +1 Sounds great! On 5/18/15 8:33 AM, Steven Jacobs wrote: Great work! Steven On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote: Hello everyone, This is my update on what I have been doing this last week: Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from blocks located in HDFS and returns complete items according to a specified xml tag. I also tested this class in a standalone hadoop cluster with xml files of various sizes, the smallest being a single file of 400 MB and the largest a collection of 5 files totalling 6.1 GB. This week I will create another implementation of the XMLInputFormat with a different way of reading and delivering files, the way I described in the same issue and I will test both solutions in a standalone and a small hadoop cluster (5-6 nodes). You can see this week's results here [2] .I will keep updating this file about the other tests. Best regards, Efi [1] https://issues.apache.org/jira/browse/VXQUERY-131 [2] https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
Re: [#131]Supporting Hadoop data and cluster management
Hello everyone, For this week the two different methods for reading complete items according to a specific tag are completed and tested in standalone hdfs deployment.In detail what each method does: The first method, I call it One Buffer Method, reads a block, saves it in a buffer, and continues reading from the other blocks until it finds a specific closing tag.It shows good results and good times in the tests. The second method, called Shared File Method, reads only the complete items contained in the block and the incomplete items from the start and end of the block are send to a shared file in the hdfs Distributed Cache. Now this method could work only for relatively small inputs, since the Distributed Cache is limited and in the case of hundreds/thousands of blocks the shared file can exceed the limit. I took the liberty of creating diagrams that show in example what each method does. [1] One Buffer Method [2] Shared File Method Every insight and feedback is more than welcome about these two methods.In my opinion the One Buffer method is simpler and more effective since it can be used for both small and large datasets. There is also a question, can the parser work on data that are missing some tags?For example the first and last tag of the xml file that are located in different blocks. Best regards, Efi [1] https://docs.google.com/drawings/d/1QmsqZMn1ifz78UvJRX6jVD-QpUUr-x6659dV8BmO6o0/edit?usp=sharing [2] https://docs.google.com/drawings/d/10tS_NV8tgH3y593R5arKIF_Ox8_cgQikzN72vMrletA/edit?usp=sharing On 05/19/2015 12:43 AM, Michael Carey wrote: +1 Sounds great! On 5/18/15 8:33 AM, Steven Jacobs wrote: Great work! Steven On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote: Hello everyone, This is my update on what I have been doing this last week: Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from blocks located in HDFS and returns complete items according to a specified xml tag. I also tested this class in a standalone hadoop cluster with xml files of various sizes, the smallest being a single file of 400 MB and the largest a collection of 5 files totalling 6.1 GB. This week I will create another implementation of the XMLInputFormat with a different way of reading and delivering files, the way I described in the same issue and I will test both solutions in a standalone and a small hadoop cluster (5-6 nodes). You can see this week's results here [2] .I will keep updating this file about the other tests. Best regards, Efi [1] https://issues.apache.org/jira/browse/VXQUERY-131 [2] https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
Re: [#131]Supporting Hadoop data and cluster management
Great work! Steven On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote: Hello everyone, This is my update on what I have been doing this last week: Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from blocks located in HDFS and returns complete items according to a specified xml tag. I also tested this class in a standalone hadoop cluster with xml files of various sizes, the smallest being a single file of 400 MB and the largest a collection of 5 files totalling 6.1 GB. This week I will create another implementation of the XMLInputFormat with a different way of reading and delivering files, the way I described in the same issue and I will test both solutions in a standalone and a small hadoop cluster (5-6 nodes). You can see this week's results here [2] .I will keep updating this file about the other tests. Best regards, Efi [1] https://issues.apache.org/jira/browse/VXQUERY-131 [2] https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
Re: [#131]Supporting Hadoop data and cluster management
+1 Sounds great! On 5/18/15 8:33 AM, Steven Jacobs wrote: Great work! Steven On Sun, May 17, 2015 at 1:15 PM, Efi efika...@gmail.com wrote: Hello everyone, This is my update on what I have been doing this last week: Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from blocks located in HDFS and returns complete items according to a specified xml tag. I also tested this class in a standalone hadoop cluster with xml files of various sizes, the smallest being a single file of 400 MB and the largest a collection of 5 files totalling 6.1 GB. This week I will create another implementation of the XMLInputFormat with a different way of reading and delivering files, the way I described in the same issue and I will test both solutions in a standalone and a small hadoop cluster (5-6 nodes). You can see this week's results here [2] .I will keep updating this file about the other tests. Best regards, Efi [1] https://issues.apache.org/jira/browse/VXQUERY-131 [2] https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
[#131]Supporting Hadoop data and cluster management
Hello everyone, This is my update on what I have been doing this last week: Created an XMLInputFormat java class with the functionalities that Hamza described in the issue [1] .The class reads from blocks located in HDFS and returns complete items according to a specified xml tag. I also tested this class in a standalone hadoop cluster with xml files of various sizes, the smallest being a single file of 400 MB and the largest a collection of 5 files totalling 6.1 GB. This week I will create another implementation of the XMLInputFormat with a different way of reading and delivering files, the way I described in the same issue and I will test both solutions in a standalone and a small hadoop cluster (5-6 nodes). You can see this week's results here [2] .I will keep updating this file about the other tests. Best regards, Efi [1] https://issues.apache.org/jira/browse/VXQUERY-131 [2] https://docs.google.com/spreadsheets/d/1kyIPR7izNMbU8ctIe34rguElaoYiWQmJpAwDb0t9MCw/edit?usp=sharing
[jira] [Issue Comment Deleted] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hamza Zafar updated VXQUERY-131: Comment: was deleted (was: Hey Efi, Congrats for getting selected! While I was researching this problem, I thought of writing an XMLInputFormat and XMLRecordReader classes to handle incomplete XML documents. I came across a nice implementation for handling XML files---in HDFS---by Apache Mahout[1]. The general idea is to see if the xml record is incomplete then read in the next block---which might reside on different node---until you find the closing tag. [1]https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java) Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517275#comment-14517275 ] Steven Jacobs commented on VXQUERY-131: --- One other thing to note is that every XML file has a main tag that covers the entire XML file. Hamza's link might address this issue as well. Steven Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507108#comment-14507108 ] Efi Kaltirimidou commented on VXQUERY-131: -- Thank you Steven for your suggestions, I will look into them, especially the approach you recommended for my second question seems quite interesting. I can share here anything helpful,for the first question, that I may find. Efi Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hamza Zafar updated VXQUERY-131: Comment: was deleted (was: Dear Preston, My Background: I am Hamza Zafar, a final year undergrad student of computer sciences at NUST, Pakistan. I have been a student researcher at HPC research center at my department. At HPC lab we are focused on developing and maintaining an open source Java based MPI MPJ-Express http://mpj-express.org/ Open-source Contributions: Considering my final year project, I worked on Apache Hadoop and MPJ-Express project. The project requires writing a new Runtime for MPJ Express, to bootstrap its processes on Hadoop YARN cluster. The new Runtime for MPJ Express will utilize the Hadoop YARN resource manager to dynamically allocate resources in terms of memory and CPU. As much of the enterprise data now resides on Hadoop Distributed File System (HDFS), this project will enable enterprise to achieve the performance of HPC and the usability and flexibility of Big Data stack. The development part of MPJ Express YARN runtime is completed, currently I am working on releasing the software in the next few weeks. A research paper is currently under review at ICCS My Thoughts about the VXQuery and YARN project: I did not have any past experience working with VXQuery project (I hope to learn it). I am comfortable writing the YARN applications. I anticipate that this project is geared towards replacing the python scripts to launch VXQuery jobs with the YARN resource manager. YARN can help spawn containers in the cluster, containers can then run the Queries on XML data files residing in HDFS. The Application Master can be very handy to reschedule the failed containers and maintain the running ones. Looking forward to work on this project :) Yours Sincerely Hamza Zafar LinkedIn: pk.linkedin.com/pub/hamza-zafar/59/739/205/ ) Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497346#comment-14497346 ] Efi Kaltirimidou commented on VXQUERY-131: -- Dear Preston, my name is Efi Kaltirimidou and I am one of the students who applied for this project,for this year's GSoC. . After reading the details you described for these goals and studying the project's code,I would like your opinion on which parts of these features,do you think,are the most challenging to implement? I think it would be good to know, in order to start reading about them and handle them without problems. Another question I have is about the YARN scheduling options.YARN offers some standard scheduling policies for workload optimization like FIFO, but also allows custom algorithms to be implemented.Does something like this exists in the current project and will be implemented in YARN or something standard will be used? Thank you, Efi Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370862#comment-14370862 ] Till Westmann commented on VXQUERY-131: --- To subscribe to the mailing list you can either - click on the subscribe link on http://vxquery.apache.org/mail-lists.html or - send an e-mail to dev-subscr...@vxquery.apache.org and follow the instructions in that e-mail. Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Westmann updated VXQUERY-131: -- Comment: was deleted (was: Referring to your comment: Can we do a simple query on HDFS? (Start by reading a local file and transfer any additional file blocks as necessary to read the whole XML file. Loses efficiency when processing multiple block files.) This implementation could be pretty straight forwards. Hadoop provides the FileSystem api to interact with data in hdfs. We can open a FSDataInputStream at a given path, if there are multiple blocks, then they are read sequentially(in-order) ) Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Westmann updated VXQUERY-131: -- Comment: was deleted (was: To subscribe to the mailing list you can either - click on the subscribe link on http://vxquery.apache.org/mail-lists.html or - send an e-mail to dev-subscr...@vxquery.apache.org and follow the instructions in that e-mail.) Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369808#comment-14369808 ] AASHEESH RANJAN commented on VXQUERY-131: - sir presently i am working on this this projecti think your project is also similar to this High Performance Distributed Computing Implements for BIG DATA using Hadoop Framework and running applications on large clusters - Super Computing It includes a distributed file system (HDFS), programming support for MapReduce, and infrastructure software for grid computing I Design framework for capturing workload statistics and replaying workload simulations to allow the assessment of framework improvements Benchmark suite for Data Intensive Supercomputing: A suite for data-intensive supercomputing application benchmarks that would present a target that Hadoop (and other map-reduce implementations) should be optimized for Design and build a scalable Internet anomaly detector over a very high throughput event stream but the goal would be low-latency as well as high throughput. Could be used for all sorts of things: intrusion detection. The open source data management software that helps organizations analyzes massive volumes of structured and unstructured data.I Deploy Hadoop cluster consist of number of server – nodes, these will be used to store data and process it in a parallel process and distributed mechanism. To create automation setup, i use Python. sir i have problem regarding how to join ur mail page...plzz help. Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367290#comment-14367290 ] Hamza Zafar commented on VXQUERY-131: - Since HDFS supports blocks of fixed sizes (default is 64mb), XML files will be divided into several blocks. The blocks will be given to datanodes at different machines. Processing the chunks of XML file in parallel requires launching the VXQuery containers at nodes where the blocks of XML file resides. Hence the queries will work on blocks in local storage. How do you plan to aggregate the results? Will there be any VXQuery reducer process, which can receive the results from other VXQuery containers (which processed the local xml blocks) ? If there is a VXQuery reducer, what would be the communication mechanism between VXQuery containers and VXQuery reducer ? Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367609#comment-14367609 ] Preston Carman commented on VXQUERY-131: Check out this application template for GSOC: http://community.staging.apache.org/gsoc#application-template VXQuery currently reads data from local files. The system understands the data partitions across nodes and creates Hyracks jobs to read local data and communicate results across nodes only when needed for the given query. HDFS may affect the Hyracks job creating or it may be independent. Depends on our approach. I see a few options on the road to an efficient XQuery on HDFS. - Can we do a simple query on HDFS? (Start by reading a local file and transfer any additional file blocks as necessary to read the whole XML file. Loses efficiency when processing multiple block files.) - Can we read a partial XML file on HDFS? (Read only XML on local nodes, but upgrade the parser to read partial XML documents. Loses some XQuery properties.) - Create a new HDFS file loader and reader to better handle the XML document properties for processing XQueries. In each of the cases, I assume that after data is read in, the VXQuery job can handle the rest. The result of the project may lead to basic approach followed by and optimized method once we understand the issues better. Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366264#comment-14366264 ] AASHEESH RANJAN commented on VXQUERY-131: - sir i am a 4th year cse student. sir i know Bigdata Hadoop and i am RHCA and i also know python... and i am also working on python cluster management script that can deploy, start and stop cluster... i want to discuss on this project..but i can't found that how i discuss..plz suggest me... Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361885#comment-14361885 ] Till Westmann commented on VXQUERY-131: --- Good questions :) 1. VXQuery currently is a pure query processor and doesn't support updates. So there's nothing that we can write while processing. However, we could certainly write the result of a query back to HDFS. I think that we haven't called that out explicitly, but it would certainly be a great addition to the ability to read from HDFS. 2. There is a very nice solution to integrate JSON and XML processing called JSONiq (http://www.jsoniq.org). JSONiq extends the XQuery data model by adding array and objects and XQuery itself by functions that work with the added instances of the data model. It would be great to extend VXQuery to also support JSONiq, but there's no plan to do that so far (but plans can change ..). Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354032#comment-14354032 ] Preston Carman commented on VXQUERY-131: http://www.w3.org/TR/xquery/#id-document-order The elements in the result could be out of order depending on the method used read the XML data. I believe this will be dependent on the file input reader. Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354036#comment-14354036 ] Preston Carman commented on VXQUERY-131: Python Scripts - https://git-wip-us.apache.org/repos/asf?p=vxquery.git;a=tree;f=vxquery-server/src/main/resources/scripts;h=aa6f2b49a285702bbdd695f1751fd49945c64880;hb=b1109faba960ef07cb6bd55b5285db057eb4d831 CLI - https://git-wip-us.apache.org/repos/asf?p=vxquery.git;a=blob;f=vxquery-cli/src/main/java/org/apache/vxquery/cli/VXQuery.java;h=080f8a12db0189d5d3d705953a84eedc1b474f53;hb=b1109faba960ef07cb6bd55b5285db057eb4d831 XML Parser - https://git-wip-us.apache.org/repos/asf?p=vxquery.git;a=tree;f=vxquery-core/src/main/java/org/apache/vxquery/xmlparser;h=27b267a29886bbcefc3c82ce13769b4afe53b421;hb=b1109faba960ef07cb6bd55b5285db057eb4d831 Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354054#comment-14354054 ] Preston Carman commented on VXQUERY-131: Look forward to see your proposal. Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351485#comment-14351485 ] Vinshul Arora commented on VXQUERY-131: --- Thanks Preston, That cleared a lot of doubts about the requirements of this Idea. So If I get your reply correctly I think we need to do something like this to get the code working as required: Connect the Apache's VXquery directly to the HDFS (Coding a framework in which different sections of XML data are correctly taken as input), run the query, store the results of that query in distributed cache and after that we can run Hadoop's traditional MapReduce job. Modifications could be done in the writing part of the XML data(When data is written in HDFS after the query is executed) as that part of code is affecting the process of parallelism. Am i heading in the right direction? Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351489#comment-14351489 ] sagarsharma commented on VXQUERY-131: - Thanks Preston, that will help quite a lot , but i have a question is that - Hadoop ecosystem is itself made to operate on unstructure data and store in HDFS in structured data , so why we need to worry about reading XML maintain XQuery's document order or not ? Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Assignee: Preston Carman Labels: gsoc, gsoc2015, hadoop, java, mentor, xml Many organizations support Hadoop. It would be nice to be able to read data from this source. The project will include creating a strategy (with the mentor's guidance) for reading XML data from HDFS and implementing it. When connecting VXQuery to HDFS, the strategy may need to consider how to read sections of an XML file. In addition, we could use Yarn as our cluster manager. The Apache Hadoop YARN (Yet Another Resource Negotiator) would be a good cluster management tool for VXQuery. If VXQuery can read data from HDFS, then why not also manage the cluster with a tool provided by Hadoop. The solution would replace the current custom python scripts for cluster management. Goal - Read XML from HDFS - Manage the VXQuery cluster with Yarn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (VXQUERY-131) Supporting Hadoop data and cluster management
[ https://issues.apache.org/jira/browse/VXQUERY-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Westmann updated VXQUERY-131: -- Issue Type: Improvement (was: Test) Supporting Hadoop data and cluster management - Key: VXQUERY-131 URL: https://issues.apache.org/jira/browse/VXQUERY-131 Project: VXQuery Issue Type: Improvement Reporter: Preston Carman Labels: GSOC Many organizations support Hadoop. It would be nice to be able to ready data from this source. In addition, we could use Yarn as our cluster manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)