Re: [R] The Future of R | API to Public Databases
Just a note on this earlier thread: R package collections for public APIs seem to be now emerging and thriving in various disciplines: - Bioinformatics the Bioconductor collection contains many API packages - rOpenSci: R tools for open science-related APIs - rOpenHealth: R tools for open healthcare-related APIs - rOpenGov: R tools for open government data and computational social science (disclaimer: I am one of the main developers for this one) These community projects and aim to fill the gap discussed in this thread. The APIs and needs are many, and best tackled with community-driven package collections written by the actual users. Have a look at those projects - your contributions to any of them is certainly welcome. best, Leo Lahti -- View this message in context: http://r.789695.n4.nabble.com/The-Future-of-R-API-to-Public-Databases-tp4293526p4690960.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
Yes, R-devel would be the right mailing list for this discussion. As some people pointed out, the problem definition is vague. This was to encourage people to share their *different* perceptions about the problem and to get to some extent a consensus. My starting point has come from my mind, consequently I must be an egocentric person. I agree on that. There are a lot of other egocentric persons who download R and just want to have their result ASAP. That's reality. The same is given with each and every special interest group (where each and every member has a special interest). Everyone cares only about his needs. That is the systematic issue we have to overcome by working together to simplify everyone's individual situation. Finally we should reach a win-win situation for all. That is my notion. What I wanted to point out was more or less about the process of a statistical research: 1. Set up your research objective 2. Find the right data (time intensive) 3. Download the right format 4. Import it, make it compatible, clean it up 5. Work with it 6. Get your results The more integrative your research objective is set up, the more time you spent on parts 1 to 3. And points 1 to 3 make up most of the time in most cases. Some people will resign due to lack of time or just due to lack of accessibility of data. I highly appreciate that a lot of people participated in this discussion, the publishers itself address the problem nowadays (just take a look at [1]) and some people are working on it in the R world (i.e. TSdbi). Reality is better than I initially perceived it. But is is not as it should be. Benjamin [1] http://sdmx.org/wp-content/uploads/2011/10/SDMX-Action-Plan-2011_2015.pdf On 15 January 2012 13:15, Prof Brian Ripley rip...@stats.ox.ac.uk wrote: On 14/01/2012 18:51, Joshua Wiley wrote: I have been following this thread, but there are many aspects of it which are unclear to me. Who are the publishers? Who are the users? What is the problem? I have a vauge sense for some of these, but it seems to me like one valuable starting place would be creating a document that clarifies everything. It is easier to tackle a concrete problem (e.g., agree on a standard numerical representation of dates and times a la ISO 8601) than something diffuse (e.g., information overload). Let alone something as vague as 'the future of R' (for which the R-devel list is the appropriate one). I believe the original poster is being egocentric: as someone said earlier, she has never had need of this concept, and I believe that is true of the vast majority of R users. The development of R per se is primarily driven by the needs of the core developers and those around them. Other R communities have sent up their own special-interest groups and sets of packages, and that would seem the way forward here. Good luck, Josh On Sat, Jan 14, 2012 at 10:02 AM, Benjamin Weberm...@bwe.im wrote: Mike We see that the publishers are aware of the problem. They don't think that the raw data is the usable for the user. Consequently they recognizing this fact with the proprietary formats. Yes, they resign in the information overload. That's pathetic. It is not a question of *which* data format, it is a question about the general concept. Where do publisher and user meet? There has to be one *defined* point which all parties agree on. I disagree with your statement that the publisher should just publish csv or cook his own API. That leads to fragmentation and inaccessibility of data. We want data to be accessible. A more pragmatic approach is needed to revolutionize the way we go about raw data. Benjamin On 14 January 2012 22:17, Mike Marchywkamarchy...@hotmail.com wrote: LOL, I remember posting about this in the past. The US gov agencies vary but mostare quite good. The big problem appears to be people who push proprietary orcommercial standards for which only one effective source exists. Some formats,like Excel and PDF come to mind and there is a disturbing trend towards theiradoption in some places where raw data is needed by many. The best thing to do is contact the informationprovider and let them know you want raw data, not images or stuff that worksin limited commercial software packages. Often data sources are valuable andthe revenue model impacts availability. If you are just arguing over different open formats, it is usually easy for someone towrite some conversion code and publish it- CSV to JSON would not be a problem for example. Data of course are quite variable and there is nothingwrong with giving provider his choice. Date: Sat, 14 Jan 2012 10:21:23 -0500 From: ja...@rampaginggeek.com To: r-help@r-project.org Subject: Re: [R] The Future of R | API to Public Databases Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some
Re: [R] The Future of R | API to Public Databases
:23 -0500 From: ja...@rampaginggeek.com To: r-help@r-project.org Subject: Re: [R] The Future of R | API to Public Databases Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some protocol 2. mapping the data to a common model Having #1 makes the import/download easier, but it really becomes useful when both are included. I think #2 is the harder problem to address. Software can usually be written to handle #1 by making a useful abstraction layer. #2 means that data has consistent names and meanings, and this requires people to agree on common definitions and a common naming convention. RDF (Resource Description Framework) and its related technologies (SPARQL, OWL, etc) are one of the many attempts to try to address this. While this effort would benefit R, I think it's best if it's part of a larger effort. Services such as DBpedia and Freebase are trying to unify many data sets using RDF. The task view and package ideas a great ideas. I'm just adding another perspective. Jason On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP). -Roy On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ** The contents of this message do not reflect any position of the U.S. Government or NOAA. ** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center 1352 Lighthouse Avenue Pacific Grove, CA 93950-2097 e-mail: roy.mendelss...@noaa.gov (Note new e-mail address) voice: (831)-648-9029 fax: (831)-648-8440 www: http://www.pfeg.noaa.gov/ Old age and treachery will overcome youth and skill. From those who have been given much, much will be expected the arc of the moral universe is long, but it bends toward justice -MLK Jr. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
Spencer I highly appreciate your input. What we need is a standard for statistics. That may reinvent the way how we see data. The recent crisis is the best proof that we are lost in our own generated information overload. The traditional approach is not working anymore. Finding the right members for the initial committee would be the hardest but most important part. Another point is that I am only a student of 21 years which has limited financial capabilities with respect to what I can commit to such kind of a work. But I have my motivation which is the *real* engine to advance an idea. I am open to work in my spare time on it. Over time I would become an expert in my own field, that is implicit in such a decision. I don't have any background of a statistician but know what the relevance of data is. It may be a solution that a fresher gives a new perspective. Starting from scratch is at some point beneficial. It will be even harder for a person like me to convince the experienced professionals to overcome their own conventional schemes and procedures. Because my approach would pay not respect to the established ones. Why the hell should I know it just better than the experts? I respect single solutions; they might work in a specific situation but they make it impossible to put everything together into a big picture which is finally required. I am really interested in leading the initiative of such a new standard. My problem is how to start. Would a scientific paper which proposes the development of a standard, be a starting point? Benjamin On 14 January 2012 08:19, Spencer Graves spencer.gra...@structuremonitoring.com wrote: A traditional way to exit a chaotic situation as you describe is to try to establish a standards committee, invite participation from suppliers and users of whatever (data in this case), apply for registration with the International Standards Organization, and organize meetings, draft and circulate a proposed standard, etc. A statistician who had published maybe 100 papers and 3 books told me that his work on ISO 9000 (I think) made a larger contribution to humanity than anything else he had done. Work on standards is one of the most boring, tedious activities I can imagine -- and can potentially be the most impactful thing one does in this life: If you have an ISO standard number for something, people who are starting something new may find it and follow it. People who are working to upgrade something may tell their management, Let's follow this standard. Customers sometimes ask their suppliers, If you follow the standard, you might get more customers. I think you could get support for such a standard effort from the American Association for the Advancement of Science, the American Economics Association, the American Statistical Association, and many other organizations, including many on-line science journals that today pressure authors of papers to put the data behind their published paper in the public domain, downloadable from their web site, etc. IMHO. Spencer On 1/13/2012 3:39 PM, Benjamin Weber wrote: The whole issue is related to the mismatch of (1) the publisher of the data and (2) the user at the rendezvous point. Both the publisher and the user don't know anything about the rendezvous point. Both want to meet but don't meet in reality. The user wastes time to find the rendezvous point defined by the publisher. The publisher assumes any rendezvous point. As per the number of publishers, the variety of the fields and the flavor of each expert, we end up in today's data world. Everyone has to waste his precious time to find out the rendezvous point. Only experts do know in which corner to focus their search on - but even they need their time to find what they want. However, each expert (of each profession) believes that his approach is the best one in the world. Finally we have a state of total confusion, where only experts can handle the information and non-experts can not even access the data without diving fully into the flood of data and their specialities. That's my point: Data is not accessible. The discussion should follow a strategical approach: - Is the classical csv file (in all its varieties) the simplest and best way? - Isn't it the responsibility of the R community to recommend standards for different kinds of data? With the existence of this rendezvous point the publisher would know a specific point which is favorable from the user's point of view. That is missing. Only a rendezvous point defined by the community can be a 'known' rendezvous point for all stakeholders, globally. I do believe that the publisher's greatest interest is data accessibility. Where is the toolkit we provide them to enable them to serve us the data exactly as we want it? No, we just try to build even more packages to be lost in the noise of information. I disagree with a proposed solution to
Re: [R] The Future of R | API to Public Databases
Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some protocol 2. mapping the data to a common model Having #1 makes the import/download easier, but it really becomes useful when both are included. I think #2 is the harder problem to address. Software can usually be written to handle #1 by making a useful abstraction layer. #2 means that data has consistent names and meanings, and this requires people to agree on common definitions and a common naming convention. RDF (Resource Description Framework) and its related technologies (SPARQL, OWL, etc) are one of the many attempts to try to address this. While this effort would benefit R, I think it's best if it's part of a larger effort. Services such as DBpedia and Freebase are trying to unify many data sets using RDF. The task view and package ideas a great ideas. I'm just adding another perspective. Jason On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP). -Roy On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ** The contents of this message do not reflect any position of the U.S. Government or NOAA. ** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center 1352 Lighthouse Avenue Pacific Grove, CA 93950-2097 e-mail: roy.mendelss...@noaa.gov (Note new e-mail address) voice: (831)-648-9029 fax: (831)-648-8440 www: http://www.pfeg.noaa.gov/ Old age and treachery will overcome youth and skill. From those who have been given much, much will be expected the arc of the moral universe is long, but it bends toward justice -MLK Jr. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
LOL, I remember posting about this in the past. The US gov agencies vary but mostare quite good. The big problem appears to be people who push proprietary orcommercial standards for which only one effective source exists. Some formats,like Excel and PDF come to mind and there is a disturbing trend towards theiradoption in some places where raw data is needed by many. The best thing to do is contact the informationprovider and let them know you want raw data, not images or stuff that worksin limited commercial software packages. Often data sources are valuable andthe revenue model impacts availability. If you are just arguing over different open formats, it is usually easy for someone towrite some conversion code and publish it- CSV to JSON would not be a problem for example. Data of course are quite variable and there is nothingwrong with giving provider his choice. Date: Sat, 14 Jan 2012 10:21:23 -0500 From: ja...@rampaginggeek.com To: r-help@r-project.org Subject: Re: [R] The Future of R | API to Public Databases Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some protocol 2. mapping the data to a common model Having #1 makes the import/download easier, but it really becomes useful when both are included. I think #2 is the harder problem to address. Software can usually be written to handle #1 by making a useful abstraction layer. #2 means that data has consistent names and meanings, and this requires people to agree on common definitions and a common naming convention. RDF (Resource Description Framework) and its related technologies (SPARQL, OWL, etc) are one of the many attempts to try to address this. While this effort would benefit R, I think it's best if it's part of a larger effort. Services such as DBpedia and Freebase are trying to unify many data sets using RDF. The task view and package ideas a great ideas. I'm just adding another perspective. Jason On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP). -Roy On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ** The contents of this message do not reflect any position of the U.S. Government or NOAA. ** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center
Re: [R] The Future of R | API to Public Databases
Mike We see that the publishers are aware of the problem. They don't think that the raw data is the usable for the user. Consequently they recognizing this fact with the proprietary formats. Yes, they resign in the information overload. That's pathetic. It is not a question of *which* data format, it is a question about the general concept. Where do publisher and user meet? There has to be one *defined* point which all parties agree on. I disagree with your statement that the publisher should just publish csv or cook his own API. That leads to fragmentation and inaccessibility of data. We want data to be accessible. A more pragmatic approach is needed to revolutionize the way we go about raw data. Benjamin On 14 January 2012 22:17, Mike Marchywka marchy...@hotmail.com wrote: LOL, I remember posting about this in the past. The US gov agencies vary but mostare quite good. The big problem appears to be people who push proprietary orcommercial standards for which only one effective source exists. Some formats,like Excel and PDF come to mind and there is a disturbing trend towards theiradoption in some places where raw data is needed by many. The best thing to do is contact the informationprovider and let them know you want raw data, not images or stuff that worksin limited commercial software packages. Often data sources are valuable andthe revenue model impacts availability. If you are just arguing over different open formats, it is usually easy for someone towrite some conversion code and publish it- CSV to JSON would not be a problem for example. Data of course are quite variable and there is nothingwrong with giving provider his choice. Date: Sat, 14 Jan 2012 10:21:23 -0500 From: ja...@rampaginggeek.com To: r-help@r-project.org Subject: Re: [R] The Future of R | API to Public Databases Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some protocol 2. mapping the data to a common model Having #1 makes the import/download easier, but it really becomes useful when both are included. I think #2 is the harder problem to address. Software can usually be written to handle #1 by making a useful abstraction layer. #2 means that data has consistent names and meanings, and this requires people to agree on common definitions and a common naming convention. RDF (Resource Description Framework) and its related technologies (SPARQL, OWL, etc) are one of the many attempts to try to address this. While this effort would benefit R, I think it's best if it's part of a larger effort. Services such as DBpedia and Freebase are trying to unify many data sets using RDF. The task view and package ideas a great ideas. I'm just adding another perspective. Jason On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP). -Roy On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R
Re: [R] The Future of R | API to Public Databases
I have been following this thread, but there are many aspects of it which are unclear to me. Who are the publishers? Who are the users? What is the problem? I have a vauge sense for some of these, but it seems to me like one valuable starting place would be creating a document that clarifies everything. It is easier to tackle a concrete problem (e.g., agree on a standard numerical representation of dates and times a la ISO 8601) than something diffuse (e.g., information overload). Good luck, Josh On Sat, Jan 14, 2012 at 10:02 AM, Benjamin Weber m...@bwe.im wrote: Mike We see that the publishers are aware of the problem. They don't think that the raw data is the usable for the user. Consequently they recognizing this fact with the proprietary formats. Yes, they resign in the information overload. That's pathetic. It is not a question of *which* data format, it is a question about the general concept. Where do publisher and user meet? There has to be one *defined* point which all parties agree on. I disagree with your statement that the publisher should just publish csv or cook his own API. That leads to fragmentation and inaccessibility of data. We want data to be accessible. A more pragmatic approach is needed to revolutionize the way we go about raw data. Benjamin On 14 January 2012 22:17, Mike Marchywka marchy...@hotmail.com wrote: LOL, I remember posting about this in the past. The US gov agencies vary but mostare quite good. The big problem appears to be people who push proprietary orcommercial standards for which only one effective source exists. Some formats,like Excel and PDF come to mind and there is a disturbing trend towards theiradoption in some places where raw data is needed by many. The best thing to do is contact the informationprovider and let them know you want raw data, not images or stuff that worksin limited commercial software packages. Often data sources are valuable andthe revenue model impacts availability. If you are just arguing over different open formats, it is usually easy for someone towrite some conversion code and publish it- CSV to JSON would not be a problem for example. Data of course are quite variable and there is nothingwrong with giving provider his choice. Date: Sat, 14 Jan 2012 10:21:23 -0500 From: ja...@rampaginggeek.com To: r-help@r-project.org Subject: Re: [R] The Future of R | API to Public Databases Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some protocol 2. mapping the data to a common model Having #1 makes the import/download easier, but it really becomes useful when both are included. I think #2 is the harder problem to address. Software can usually be written to handle #1 by making a useful abstraction layer. #2 means that data has consistent names and meanings, and this requires people to agree on common definitions and a common naming convention. RDF (Resource Description Framework) and its related technologies (SPARQL, OWL, etc) are one of the many attempts to try to address this. While this effort would benefit R, I think it's best if it's part of a larger effort. Services such as DBpedia and Freebase are trying to unify many data sets using RDF. The task view and package ideas a great ideas. I'm just adding another perspective. Jason On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP). -Roy On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting
Re: [R] The Future of R | API to Public Databases
The situation for this kind of interface is much more advanced (for economic time series data) than has been suggested in other postings. Several of the organizations you mention support SDMX and I believe there is a working R interface to SDMX which has not yet been made public. A more complete list of organizations that I think already have working server side support for SDMX is: the OECD, Eurostat, the ECB, the IMF, the UN, the BIS, the Federal Reserve Board, the World Bank, the Italian Statistics agency, and to a small extent by the Bank of Canada. I have a working API to several time series databases (TS* packages on CRAN), and a partially working interface to SDMX, but have postponed further development of that in the hope that the already working code will be made available. Please see http://tsdbi.r-forge.r-project.org/ for more details. I would, of course, be happy to have other developers involved in this project. If you think you can contribute then see r-forge.r-project.org for details on how to join projects. Paul On 12-01-14 06:00 AM, r-help-requ...@r-project.org wrote: Date: Sat, 14 Jan 2012 02:44:07 +0530 From: Benjamin Weberm...@bwe.im To:r-help@r-project.org Subject: [R] The Future of R | API to Public Databases Message-ID: cany9q8k+zyvrkjjgbjp+jtnyaw15gqkocivyvpgwgyqa9dl...@mail.gmail.com Content-Type: text/plain; charset=UTF-8 Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
...@rampaginggeek.com To: r-help@r-project.org Subject: Re: [R] The Future of R | API to Public Databases Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some protocol 2. mapping the data to a common model Having #1 makes the import/download easier, but it really becomes useful when both are included. I think #2 is the harder problem to address. Software can usually be written to handle #1 by making a useful abstraction layer. #2 means that data has consistent names and meanings, and this requires people to agree on common definitions and a common naming convention. RDF (Resource Description Framework) and its related technologies (SPARQL, OWL, etc) are one of the many attempts to try to address this. While this effort would benefit R, I think it's best if it's part of a larger effort. Services such as DBpedia and Freebase are trying to unify many data sets using RDF. The task view and package ideas a great ideas. I'm just adding another perspective. Jason On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP). -Roy On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ** The contents of this message do not reflect any position of the U.S. Government or NOAA. ** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center 1352 Lighthouse Avenue Pacific Grove, CA 93950-2097 e-mail: roy.mendelss...@noaa.gov (Note new e-mail address) voice: (831)-648-9029 fax: (831)-648-8440 www: http://www.pfeg.noaa.gov/ Old age and treachery will overcome youth and skill. From those who have been given much, much will be expected the arc of the moral universe is long, but it bends toward justice -MLK Jr. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code
Re: [R] The Future of R | API to Public Databases
On 14/01/2012 18:51, Joshua Wiley wrote: I have been following this thread, but there are many aspects of it which are unclear to me. Who are the publishers? Who are the users? What is the problem? I have a vauge sense for some of these, but it seems to me like one valuable starting place would be creating a document that clarifies everything. It is easier to tackle a concrete problem (e.g., agree on a standard numerical representation of dates and times a la ISO 8601) than something diffuse (e.g., information overload). Let alone something as vague as 'the future of R' (for which the R-devel list is the appropriate one). I believe the original poster is being egocentric: as someone said earlier, she has never had need of this concept, and I believe that is true of the vast majority of R users. The development of R per se is primarily driven by the needs of the core developers and those around them. Other R communities have sent up their own special-interest groups and sets of packages, and that would seem the way forward here. Good luck, Josh On Sat, Jan 14, 2012 at 10:02 AM, Benjamin Weberm...@bwe.im wrote: Mike We see that the publishers are aware of the problem. They don't think that the raw data is the usable for the user. Consequently they recognizing this fact with the proprietary formats. Yes, they resign in the information overload. That's pathetic. It is not a question of *which* data format, it is a question about the general concept. Where do publisher and user meet? There has to be one *defined* point which all parties agree on. I disagree with your statement that the publisher should just publish csv or cook his own API. That leads to fragmentation and inaccessibility of data. We want data to be accessible. A more pragmatic approach is needed to revolutionize the way we go about raw data. Benjamin On 14 January 2012 22:17, Mike Marchywkamarchy...@hotmail.com wrote: LOL, I remember posting about this in the past. The US gov agencies vary but mostare quite good. The big problem appears to be people who push proprietary orcommercial standards for which only one effective source exists. Some formats,like Excel and PDF come to mind and there is a disturbing trend towards theiradoption in some places where raw data is needed by many. The best thing to do is contact the informationprovider and let them know you want raw data, not images or stuff that worksin limited commercial software packages. Often data sources are valuable andthe revenue model impacts availability. If you are just arguing over different open formats, it is usually easy for someone towrite some conversion code and publish it- CSV to JSON would not be a problem for example. Data of course are quite variable and there is nothingwrong with giving provider his choice. Date: Sat, 14 Jan 2012 10:21:23 -0500 From: ja...@rampaginggeek.com To: r-help@r-project.org Subject: Re: [R] The Future of R | API to Public Databases Web services are only part of the problem. In essence, there are at least two facets: 1. downloading the data using some protocol 2. mapping the data to a common model Having #1 makes the import/download easier, but it really becomes useful when both are included. I think #2 is the harder problem to address. Software can usually be written to handle #1 by making a useful abstraction layer. #2 means that data has consistent names and meanings, and this requires people to agree on common definitions and a common naming convention. RDF (Resource Description Framework) and its related technologies (SPARQL, OWL, etc) are one of the many attempts to try to address this. While this effort would benefit R, I think it's best if it's part of a larger effort. Services such as DBpedia and Freebase are trying to unify many data sets using RDF. The task view and package ideas a great ideas. I'm just adding another perspective. Jason On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because
[R] The Future of R | API to Public Databases
Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
R is Open Source. You're welcome to write tools, and submit your package to CRAN. I think some part of this has been done, based on questions to the list asking about those parts. Personally, I've been using S-Plus and then R for 18 years, and never required data from any of them. Which doesn't make it not important, but suggests that public databases aren't the be-all and end-all for R use. Sarah On Fri, Jan 13, 2012 at 4:14 PM, Benjamin Weber m...@bwe.im wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
The WDI package on CRAN already provide access to the World Bank data through their API, we also have an inhouse package for FAOSTAT here at FAO but it is not mature enough to be released on CRAN yet. Not sure about other international organisations but I do agree that it would be nice if there is a package which would make these data more readily to R users. On 13/01/12 22:58, Sarah Goslee wrote: R is Open Source. You're welcome to write tools, and submit your package to CRAN. I think some part of this has been done, based on questions to the list asking about those parts. Personally, I've been using S-Plus and then R for 18 years, and never required data from any of them. Which doesn't make it not important, but suggests that public databases aren't the be-all and end-all for R use. Sarah On Fri, Jan 13, 2012 at 4:14 PM, Benjamin Weberm...@bwe.im wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
HI Benjamin: What would make this easier is if these sites used standardized web services, so it would only require writing once. data.gov is the worst example, they spun the own, weak service. There is a lot of environmental data available through OPenDAP, and that is supported in the ncdf4 package. My own group has a service called ERDDAP that is entirely RESTFul, see: http://coastwatch.pfel.noaa.gov/erddap and http://upwell.pfeg.noaa.gov/erddap We provide R (and matlab) scripts that automate the extract for certain cases, see: http://coastwatch.pfeg.noaa.gov/xtracto/ We also have a tool called the Environmental Data Connector (EDC) that provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you to subset data that is served by OPeNDAP, ERDDAP, certain Sensor Observation Service (SOS) servers, and have it read directly into R. It is freely available at: http://www.pfeg.noaa.gov/products/EDC/ We can write such tools because the service is either standardized (OPeNDAP, SOS) or is easy to implement (ERDDAP). -Roy On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ** The contents of this message do not reflect any position of the U.S. Government or NOAA. ** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center 1352 Lighthouse Avenue Pacific Grove, CA 93950-2097 e-mail: roy.mendelss...@noaa.gov (Note new e-mail address) voice: (831)-648-9029 fax: (831)-648-8440 www: http://www.pfeg.noaa.gov/ Old age and treachery will overcome youth and skill. From those who have been given much, much will be expected the arc of the moral universe is long, but it bends toward justice -MLK Jr. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
Sarah, I agree; I think it would be the exception rather than the rule that one would access these public data sources given the range of needs of R users, who are generally analyzing their own data. Plus, IMO, it just is not very difficult to reformat the data to a suitable format, if need be, to import into R. Tom On Fri, Jan 13, 2012 at 4:58 PM, Sarah Goslee sarah.gos...@gmail.comwrote: R is Open Source. You're welcome to write tools, and submit your package to CRAN. I think some part of this has been done, based on questions to the list asking about those parts. Personally, I've been using S-Plus and then R for 18 years, and never required data from any of them. Which doesn't make it not important, but suggests that public databases aren't the be-all and end-all for R use. Sarah On Fri, Jan 13, 2012 at 4:14 PM, Benjamin Weber m...@bwe.im wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Thomas E Adams National Weather Service Ohio River Forecast Center 1901 South State Route 134 Wilmington, OH 45177 EMAIL: thomas.ad...@noaa.gov VOICE: 937-383-0528 FAX:937-383-0033 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
It's a nice idea, but I wouldn't be optimistic about it happening: Each of these public databases no doubt has its own more or less unique API, and the people likely to know the API well enough to write R code to access any particular database will be specialists in that field. They likely won't know much if anything about other public databases. The likelihood of a group forming to develop ** and maintain ** a single R package to access the no-doubt huge variety of public databases strikes me as small. However, this looks like a great opportunity for a new CRAN Task View. The task view would simply identify which packages connect to which public databases. (sorry, I can't volunteer) -Don p.s. I can mention openair as a package that has tools to access public databases. -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 1/13/12 2:12 PM, MK mkao006rm...@gmail.com wrote: The WDI package on CRAN already provide access to the World Bank data through their API, we also have an inhouse package for FAOSTAT here at FAO but it is not mature enough to be released on CRAN yet. Not sure about other international organisations but I do agree that it would be nice if there is a package which would make these data more readily to R users. On 13/01/12 22:58, Sarah Goslee wrote: R is Open Source. You're welcome to write tools, and submit your package to CRAN. I think some part of this has been done, based on questions to the list asking about those parts. Personally, I've been using S-Plus and then R for 18 years, and never required data from any of them. Which doesn't make it not important, but suggests that public databases aren't the be-all and end-all for R use. Sarah On Fri, Jan 13, 2012 at 4:14 PM, Benjamin Weberm...@bwe.im wrote: Dear R Users - R is a wonderful software package. CRAN provides a variety of tools to work on your data. But R is not apt to utilize all the public databases in an efficient manner. I observed the most tedious part with R is searching and downloading the data from public databases and putting it into the right format. I could not find a package on CRAN which offers exactly this fundamental capability. Imagine R is the unified interface to access (and analyze) all public data in the easiest way possible. That would create a real impact, would put R a big leap forward and would enable us to see the world with different eyes. There is a lack of a direct connection to the API of these databases, to name a few: - Eurostat - OECD - IMF - Worldbank - UN - FAO - data.gov - ... The ease of access to the data is the key of information processing with R. How can we handle the flow of information noise? R has to give an answer to that with an extensive API to public databases. I would love your comments and ideas as a contribution in a vital discussion. Benjamin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
On 1/13/2012 2:26 PM, MacQueen, Don wrote: It's a nice idea, but I wouldn't be optimistic about it happening: Each of these public databases no doubt has its own more or less unique API, and the people likely to know the API well enough to write R code to access any particular database will be specialists in that field. They likely won't know much if anything about other public databases. The likelihood of a group forming to develop ** and maintain ** a single R package to access the no-doubt huge variety of public databases strikes me as small. I agree. The more reasonable model is a collection of packages, each of which can access a particular data source. However, this looks like a great opportunity for a new CRAN Task View. The task view would simply identify which packages connect to which public databases. (sorry, I can't volunteer) A CRAN Task View would be well suited for this. I have tagged these sort of packages on crantastic with the onlineData tag when I happen to notice one, but I have not made a concerted effort to find all packages. A Task View would be even better. http://crantastic.org/tags/onlineData -Don p.s. I can mention openair as a package that has tools to access public databases. Tagged it. -- Brian S. Diggs, PhD Senior Research Associate, Department of Surgery Oregon Health Science University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
The whole issue is related to the mismatch of (1) the publisher of the data and (2) the user at the rendezvous point. Both the publisher and the user don't know anything about the rendezvous point. Both want to meet but don't meet in reality. The user wastes time to find the rendezvous point defined by the publisher. The publisher assumes any rendezvous point. As per the number of publishers, the variety of the fields and the flavor of each expert, we end up in today's data world. Everyone has to waste his precious time to find out the rendezvous point. Only experts do know in which corner to focus their search on - but even they need their time to find what they want. However, each expert (of each profession) believes that his approach is the best one in the world. Finally we have a state of total confusion, where only experts can handle the information and non-experts can not even access the data without diving fully into the flood of data and their specialities. That's my point: Data is not accessible. The discussion should follow a strategical approach: - Is the classical csv file (in all its varieties) the simplest and best way? - Isn't it the responsibility of the R community to recommend standards for different kinds of data? With the existence of this rendezvous point the publisher would know a specific point which is favorable from the user's point of view. That is missing. Only a rendezvous point defined by the community can be a 'known' rendezvous point for all stakeholders, globally. I do believe that the publisher's greatest interest is data accessibility. Where is the toolkit we provide them to enable them to serve us the data exactly as we want it? No, we just try to build even more packages to be lost in the noise of information. I disagree with a proposed solution to have a maintained package or a bunch of packages which just combines connections to the existing databases and keeping them up to date. It is a question of time when the user will be lost there. Such an approach is neither feasible, nor efficient. We should just tell them where we would like to meet. Benjamin On 14 January 2012 04:58, Brian Diggs dig...@ohsu.edu wrote: On 1/13/2012 2:26 PM, MacQueen, Don wrote: It's a nice idea, but I wouldn't be optimistic about it happening: Each of these public databases no doubt has its own more or less unique API, and the people likely to know the API well enough to write R code to access any particular database will be specialists in that field. They likely won't know much if anything about other public databases. The likelihood of a group forming to develop ** and maintain ** a single R package to access the no-doubt huge variety of public databases strikes me as small. I agree. The more reasonable model is a collection of packages, each of which can access a particular data source. However, this looks like a great opportunity for a new CRAN Task View. The task view would simply identify which packages connect to which public databases. (sorry, I can't volunteer) A CRAN Task View would be well suited for this. I have tagged these sort of packages on crantastic with the onlineData tag when I happen to notice one, but I have not made a concerted effort to find all packages. A Task View would be even better. http://crantastic.org/tags/onlineData -Don p.s. I can mention openair as a package that has tools to access public databases. Tagged it. -- Brian S. Diggs, PhD Senior Research Associate, Department of Surgery Oregon Health Science University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] The Future of R | API to Public Databases
A traditional way to exit a chaotic situation as you describe is to try to establish a standards committee, invite participation from suppliers and users of whatever (data in this case), apply for registration with the International Standards Organization, and organize meetings, draft and circulate a proposed standard, etc. A statistician who had published maybe 100 papers and 3 books told me that his work on ISO 9000 (I think) made a larger contribution to humanity than anything else he had done. Work on standards is one of the most boring, tedious activities I can imagine -- and can potentially be the most impactful thing one does in this life: If you have an ISO standard number for something, people who are starting something new may find it and follow it. People who are working to upgrade something may tell their management, Let's follow this standard. Customers sometimes ask their suppliers, If you follow the standard, you might get more customers. I think you could get support for such a standard effort from the American Association for the Advancement of Science, the American Economics Association, the American Statistical Association, and many other organizations, including many on-line science journals that today pressure authors of papers to put the data behind their published paper in the public domain, downloadable from their web site, etc. IMHO. Spencer On 1/13/2012 3:39 PM, Benjamin Weber wrote: The whole issue is related to the mismatch of (1) the publisher of the data and (2) the user at the rendezvous point. Both the publisher and the user don't know anything about the rendezvous point. Both want to meet but don't meet in reality. The user wastes time to find the rendezvous point defined by the publisher. The publisher assumes any rendezvous point. As per the number of publishers, the variety of the fields and the flavor of each expert, we end up in today's data world. Everyone has to waste his precious time to find out the rendezvous point. Only experts do know in which corner to focus their search on - but even they need their time to find what they want. However, each expert (of each profession) believes that his approach is the best one in the world. Finally we have a state of total confusion, where only experts can handle the information and non-experts can not even access the data without diving fully into the flood of data and their specialities. That's my point: Data is not accessible. The discussion should follow a strategical approach: - Is the classical csv file (in all its varieties) the simplest and best way? - Isn't it the responsibility of the R community to recommend standards for different kinds of data? With the existence of this rendezvous point the publisher would know a specific point which is favorable from the user's point of view. That is missing. Only a rendezvous point defined by the community can be a 'known' rendezvous point for all stakeholders, globally. I do believe that the publisher's greatest interest is data accessibility. Where is the toolkit we provide them to enable them to serve us the data exactly as we want it? No, we just try to build even more packages to be lost in the noise of information. I disagree with a proposed solution to have a maintained package or a bunch of packages which just combines connections to the existing databases and keeping them up to date. It is a question of time when the user will be lost there. Such an approach is neither feasible, nor efficient. We should just tell them where we would like to meet. Benjamin On 14 January 2012 04:58, Brian Diggsdig...@ohsu.edu wrote: On 1/13/2012 2:26 PM, MacQueen, Don wrote: It's a nice idea, but I wouldn't be optimistic about it happening: Each of these public databases no doubt has its own more or less unique API, and the people likely to know the API well enough to write R code to access any particular database will be specialists in that field. They likely won't know much if anything about other public databases. The likelihood of a group forming to develop ** and maintain ** a single R package to access the no-doubt huge variety of public databases strikes me as small. I agree. The more reasonable model is a collection of packages, each of which can access a particular data source. However, this looks like a great opportunity for a new CRAN Task View. The task view would simply identify which packages connect to which public databases. (sorry, I can't volunteer) A CRAN Task View would be well suited for this. I have tagged these sort of packages on crantastic with the onlineData tag when I happen to notice one, but I have not made a concerted effort to find all packages. A Task View would be even better. http://crantastic.org/tags/onlineData -Don p.s. I can mention openair as a package that has tools to access public databases. Tagged it. -- Brian S. Diggs, PhD Senior Research Associate, Department