Mauro, Agreed 100% (I'm actually a regular R user), but according to Eduardo's needs (or my understanding of his needs), I think that might be an overkill here. I have replicated and completed Eduardo's table in less than 30 minutes with OpenRefine, while using R would probably take a larger amount of time just to get the data. But again, for more serious analytics, few things (if any) beat R.
Cheers! El mi? 09/09/2015, 22:59, Mauro Cavalcanti <maurobio at gmail.com> escribi?: > > Javier, > > The problem with these tools (LontraHarvest, OpenRefine, etc.) is that > they are just data *retrieval* tools, not providing for data analytical and > representation functionalities -- one or more different tools should be > used after retrieving the data of interest for plotting maps, charts, > tabulation, statistical analyses, etc. > This is just where R excels, allowing to perform all these operations in a > unified, straightforward workflow. > > Salud! > > 2015-09-09 17:23 GMT-03:00 Javier Otegui <javier.otegui at gmail.com>: > >> Hi Eduardo (et al.), >> >> If I understand correctly, the list at https://goo.gl/3wysaA shows the >> resources with data from Brazil and you want to filter out those with >> records other than Plants, am I right? Have you considered using OpenRefine >> (http://openrefine.org/) for this task? OpenRefine has a service to >> fetch URLs built based on data from other columns, which plays very well >> with GBIF APIs. You can make the program dinamically build the API request >> URL based on the dataset UUID, and fetch and parse the JSON response, >> without having to download the data and without having to code almost >> anything. The way I would go here is: >> >> 1. Create a column based off of the value in column A of your table, >> to extract just the dataset UUID >> 2. Create a new column fetching the GBIF API, adding the value in the >> previous column to a template URL: >> http://api.gbif.org/v1/occurrence/search?TAXON_KEY=6&limit=1&DATASET_KEY= >> <value>. The "limit:1" part makes things faster by avoiding having to >> show the default 20 records in the column >> 3. Create yet another column parsing the JSON result from the >> previous column, extracting just the value in the field "count". The >> result >> is the number of plant records in that dataset (therefore, resources such >> as FishBase will have a value of zero) >> >> Actually, you can add as many columns as you want, with as many API >> calls, to fill the rest of the fields in your table. Using the "registry" >> API, you can get the title, external data link and the protocol (IPT, >> DiGIR...). >> >> Hope this helps. Let me know if you are interested in this approach and >> need more help using OpenRefine. >> Cheers! >> >> Javier Otegui >> http://www.jotegui.com >> >> On Wed, Sep 9, 2015 at 8:07 PM, Mauro Cavalcanti <maurobio at gmail.com> >> wrote: >> >>> Scott, >>> >>> That's my very point - that using R and rgbif should be the best path to >>> take in this case, both because of the easier access to the GBIF API >>> provided by rgbif and the HUGE data analytical capabilities of R itself. I >>> had been working on a paper discussing this in the context of conservation >>> databases (using R/rgbif and a Red-Listed group of mammals as an exemple), >>> but unfortunately this work has been delayed by unexpected health problems. >>> Hope it can be the light someday, however. >>> >>> Best regards, >>> Em 09/09/2015 14:44, "Scott Chamberlain" <scott at ropensci.org> escreveu: >>> >>>> Note that the R client rgbif does interface with the GBIF download API >>>> in addition to the search API - making it easier to deal with larger >>>> datasets. This works even if you downloaded bulk data from the GBIF GUI. >>>> Ignore this if you don't use R :) >>>> >>>> Best, S >>>> >>>> On Wed, Sep 9, 2015 at 10:35 AM Alex Thompson <godfoder at acis.ufl.edu> >>>> wrote: >>>> >>>>> I'm kind of seconding Rod here. >>>>> >>>>> It might make more sense, depending on your use case and local >>>>> computer resources, to just get a download of Plantae *AND* Brazil from >>>>> GBIF periodically, then process that to exclude existing Brazilian >>>>> datasets. You could then use something like Apache hadoop / spark to >>>>> efficiently split the file by dataset or by institution code. >>>>> >>>>> This would greatly simplify your interactions with GBIF (down to just >>>>> periodically generating a download programmatically) and you would have an >>>>> easy place to insert any additional data transformations you want. This is >>>>> the path i take for my work at least - the incremental cost of a couple >>>>> million more records is worth the reduction in complexity overall. >>>>> >>>>> >>>>> - Alex >>>>> >>>>> >>>>> On 09/09/2015 12:16 PM, Eduardo Dalcin wrote: >>>>> >>>>> Hi Rod, >>>>> >>>>> The real purpose is to have a list of UUID and the "source web page" >>>>> for the data set. Thus, one way to do it is to select those resources that >>>>> counts <> 0 for PLANTAE *AND* Brazil. >>>>> >>>>> I don't want to do any stats analysis, but feed up one local >>>>> harverster / agregator. >>>>> >>>>> The problem is, considering the reply from Jan Legind at Sep 3, we >>>>> have to check one by one (https://goo.gl/3wysaA) to check if it is a >>>>> Herbarium / Preserved Specimen (Plantae) or not, from the request >>>>> http://api.gbif.org/v1/occurrence/counts/datasets?country=BR&taxonKey=6&basisOfRecord=PRESERVED_SPECIMEN >>>>> . >>>>> >>>>> Does it make sense? >>>>> >>>>> Thanks for your curiosity! :) >>>>> >>>>> Cheers, >>>>> >>>>> Eduardo >>>>> >>>>> >>>>> -------------------------------- >>>>> *Eduardo Dalcin >>>>> <https://mailtrack.io/trace/link/5516ed5e4f903c6ee9bd9fb3876fb65ffffc687c?url=http%3A%2F%2Feduardo.dalc.in&signature=cda9e9bf584a828c>* >>>>> Instituto de Pesquisas Jardim Bot?nico do Rio de Janeiro - JBRJ >>>>> e-mail: edalcin at jbrj.gov.br >>>>> Trabalho / Work: +55 21 3204 2116 >>>>> -------------------------------- >>>>> *e-mail alternativo / * *alternate email:** edalcin at jbrj.org >>>>> <edalcin at jbrj.org>* >>>>> -------------------------------- >>>>> Agendar reuni?o / Schedule a meeting: http://agendar.dalc.in >>>>> <https://mailtrack.io/trace/link/3a5eaa1df56016285886497766577e5357ddc6c1?url=http%3A%2F%2Fagendar.dalc.in&signature=c4e8d8113c34937f> >>>>> >>>>> On Mon, Sep 7, 2015 at 12:33 PM, Roderic Page < >>>>> Roderic.Page at glasgow.ac.uk> wrote: >>>>> >>>>>> Hi Eduardo, >>>>>> >>>>>> I?m curious, is the purpose to get counts by dataset by country, or >>>>>> to get all the plant occurrences for Brazil? The later can be obtained by >>>>>> downloading all plant occurrences in Brazil >>>>>> http://www.gbif.org/occurrence/search?TAXON_KEY=6&COUNTRY=BR (you >>>>>> could then compute the per-dataset stats locally). I realise that this >>>>>> isn?t as convenient as having GBIF slice the data for you in the API. >>>>>> >>>>>> Regards >>>>>> >>>>>> Rod >>>>>> >>>>>> --------------------------------------------------------- >>>>>> Roderic Page >>>>>> Professor of Taxonomy >>>>>> Institute of Biodiversity, Animal Health and Comparative Medicine >>>>>> College of Medical, Veterinary and Life Sciences >>>>>> Graham Kerr Building >>>>>> University of Glasgow >>>>>> Glasgow G12 8QQ, UK >>>>>> >>>>>> Email: Roderic.Page at glasgow.ac.uk >>>>>> Tel: +44 141 330 4778 <%2B44%20141%20330%204778> >>>>>> Skype: rdmpage >>>>>> Facebook: http://www.facebook.com/rdmpage >>>>>> LinkedIn: http://uk.linkedin.com/in/rdmpage >>>>>> Twitter: http://twitter.com/rdmpage >>>>>> Blog: http://iphylo.blogspot.com >>>>>> ORCID: http://orcid.org/0000-0002-7101-9767 >>>>>> Citations: >>>>>> http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ >>>>>> ResearchGate https://www.researchgate.net/profile/Roderic_Page >>>>>> >>>>>> >>>>>> On 4 Sep 2015, at 10:39, Eduardo Dalcin <edalcin at jbrj.org> wrote: >>>>>> >>>>>> Hi Markus, >>>>>> >>>>>> Yes, that's a shame I can't have country and "nub" together. There is >>>>>> any hope about it? >>>>>> >>>>>> Eduardo >>>>>> >>>>>> >>>>>> -------------------------------- >>>>>> *Eduardo Dalcin >>>>>> <https://mailtrack.io/trace/link/bac23864202354f3789938ce352a878faa0cd8b8?url=http%3A%2F%2Feduardo.dalc.in&signature=aea58ef6f439535b>* >>>>>> Instituto de Pesquisas Jardim Bot?nico do Rio de Janeiro - JBRJ >>>>>> e-mail: edalcin at jbrj.gov.br >>>>>> Trabalho / Work: +55 21 3204 2116 >>>>>> -------------------------------- >>>>>> *e-mail alternativo / * *alternate email:** edalcin at jbrj.org >>>>>> <edalcin at jbrj.org>* >>>>>> -------------------------------- >>>>>> Agendar reuni?o / Schedule a meeting: http://agendar.dalc.in >>>>>> <https://mailtrack.io/trace/link/db57b837be515d4b7caefe43d55b60467cd7c2c1?url=http%3A%2F%2Fagendar.dalc.in&signature=69b244942739c0f5> >>>>>> >>>>>> On Thu, Sep 3, 2015 at 4:29 PM, Markus D?ring <mdoering at gbif.org> >>>>>> wrote: >>>>>> >>>>>>> Eduardo, >>>>>>> >>>>>>> as you might have seen from my issue comment the webservice uses a >>>>>>> different parameter name for taxonKey which is a bug we need to fix at >>>>>>> some >>>>>>> point. >>>>>>> Please use nubKey for now to use the service like that: >>>>>>> >>>>>>> http://api.gbif.org/v1/occurrence/counts/datasets?nubKey=6 >>>>>>> >>>>>>> The real problem for you will be that we do not support the >>>>>>> combination of the country and the taxon filter, just one of the two. So >>>>>>> you cannot search for plants in Brazil I am afraid, just for datasets >>>>>>> about >>>>>>> Brazil and datasets with plant records. >>>>>>> >>>>>>> Markus >>>>>>> >>>>>>> >>>>>>> >>>>>>> > On 03 Sep 2015, at 14:12, Eduardo Dalcin <edalcin at jbrj.org> wrote: >>>>>>> > >>>>>>> > Thanks Jan. I'll keep exploring and I'll be in touch, if I need. >>>>>>> > >>>>>>> > Best, >>>>>>> > >>>>>>> > Eduardo >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > -------------------------------- >>>>>>> > Eduardo Dalcin >>>>>>> > Instituto de Pesquisas Jardim Bot?nico do Rio de Janeiro - JBRJ >>>>>>> > e-mail: edalcin at jbrj.gov.br >>>>>>> > Trabalho / Work: +55 21 3204 2116 >>>>>>> > -------------------------------- >>>>>>> > e-mail alternativo / alternate email: edalcin at jbrj.org >>>>>>> > -------------------------------- >>>>>>> > Agendar reuni?o / Schedule a meeting: http://agendar.dalc.in >>>>>>> <https://mailtrack.io/trace/link/db57b837be515d4b7caefe43d55b60467cd7c2c1?url=http%3A%2F%2Fagendar.dalc.in&signature=69b244942739c0f5> >>>>>>> > >>>>>>> > On Thu, Sep 3, 2015 at 4:51 AM, Jan Legind [GBIF] < >>>>>>> jlegind at gbif.org> wrote: >>>>>>> > Dear Eduardo, >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Thanks for getting in touch with us about these issues. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > The first request >>>>>>> http://api.gbif.org/v1/occurrence/count?country=BR&taxonKey=6&basisOfRecord=PRESERVED_SPECIMEN >>>>>>> returns the number of records located in Brazil for the facets in the >>>>>>> request. >>>>>>> > >>>>>>> > The second query >>>>>>> http://api.gbif.org/v1/occurrence/counts/datasets?country=BR&taxonKey=6&basisOfRecord=PRESERVED_SPECIMEN >>>>>>> uses the Occurrence Inventories web service >>>>>>> http://www.gbif.org/developer/occurrence#inventories which does not >>>>>>> support the basis-of-record facet in the /datasets request. I understand >>>>>>> that it would be better if the API response yielded an error message in >>>>>>> this instance. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Concerning the other issues ? you are indeed right that the counts >>>>>>> do not make sense in the context of taxon key 6 which is Plantae. >>>>>>> Actually >>>>>>> the API does not handle the taxonKey search at all, contrary to what the >>>>>>> documentation states: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > /occurrence/counts/datasets >>>>>>> > >>>>>>> > GET >>>>>>> > >>>>>>> > Counts >>>>>>> > >>>>>>> > Lists occurrence counts for datasets that cover a given taxon or >>>>>>> country. >>>>>>> > >>>>>>> > country, taxonKey >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > As you can see here, >>>>>>> http://api.gbif.org/v1/occurrence/counts/datasets?taxonKey=6 , this >>>>>>> request doesn?t return anything. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > The GBIF developers will handle this issue in due time. >>>>>>> > >>>>>>> > You can follow the issue in our bug tracking service here: >>>>>>> http://dev.gbif.org/issues/browse/POR-2828 >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > With best regards, >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Jan K. Legind >>>>>>> > >>>>>>> > Data manager, GBIF Secretariat >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > From: API-users [mailto:api-users-bounces at lists.gbif.org] On >>>>>>> Behalf Of Eduardo Dalcin >>>>>>> > Sent: 2. september 2015 20:06 >>>>>>> > To: api-users at lists.gbif.org; dev at gbif.org >>>>>>> > Cc: Jo?o Monnerat Lanna; Nat?lia Queiroz; Diogo Silva; Laura; >>>>>>> Ricardo Avancini >>>>>>> > Subject: [API-users] Some questions from a begginer >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Hi folks, >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > This is my first message to the list. So, please, be nice :) >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > I'm working here at Rio de Janeiro Botanical Garden, together with >>>>>>> the guys at the National Center for Flora Conservation. We are doing the >>>>>>> risk assessment of the Brazilian flora to the government. We assess, so >>>>>>> far, the risk of ca. 6.000 species, but we still have to assess ca. >>>>>>> 35.000. >>>>>>> Access occurrence records for Brazil is crucial, and every occurrence is >>>>>>> important. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > That means that we have to put together occurrence data from >>>>>>> different sources and, after the first batch of the risk assessment, we >>>>>>> realize that we need to build up our aggregator. We are planning to do >>>>>>> this >>>>>>> with the Lontra-harvester, with the help of the guys at Brazilian GBIF >>>>>>> Node. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > So, the one of the firsts steps was to list the available >>>>>>> resources to understand the dimension of the task and, that brings me >>>>>>> to my >>>>>>> questions. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > First: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > The request: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> http://api.gbif.org/v1/occurrence/count?country=BR&taxonKey=6&basisOfRecord=PRESERVED_SPECIMEN >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > returns 4.982.689 records >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > And the request: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> http://api.gbif.org/v1/occurrence/counts/datasets?country=BR&taxonKey=6&basisOfRecord=PRESERVED_SPECIMEN >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > returns (here) 7.406.310 records >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Comments? >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Second: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > The request: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> http://api.gbif.org/v1/occurrence/count?country=BR&taxonKey=6&basisOfRecord=PRESERVED_SPECIMEN >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > return things like this: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > "197908d0-5565-11d8-b290-b8a03c50a862":27629 >>>>>>> > >>>>>>> > >>>>>>> > But the consult of the same dataset: >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> http://www.gbif.org/occurrence/search?TAXON_KEY=6&DATASET_KEY=197908d0-5565-11d8-b290-b8a03c50a862 >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Returns "null" (of course, is a FishBase!) >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > I have plenty of examples like this, on yellow here (not >>>>>>> finished!): >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> https://docs.google.com/spreadsheets/d/1msUjwMLoKwnXxJFzF20SeN_C65RIkGLbwaYyj459VTc/edit?usp=sharing >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Comments? >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > I think those two questions is a good start. Please, let me know >>>>>>> if I'm doing something wrong. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Cheers, >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Eduardo >>>>>>> > >>>>>>> > -------------------------------- >>>>>>> > >>>>>>> > Eduardo Dalcin >>>>>>> > >>>>>>> > Instituto de Pesquisas Jardim Bot?nico do Rio de Janeiro - JBRJ >>>>>>> > >>>>>>> > e-mail: edalcin at jbrj.gov.br >>>>>>> > >>>>>>> > Trabalho / Work: +55 21 3204 2116 >>>>>>> > >>>>>>> > -------------------------------- >>>>>>> > >>>>>>> > e-mail alternativo / alternate email: edalcin at jbrj.org >>>>>>> > >>>>>>> > -------------------------------- >>>>>>> > >>>>>>> > Agendar reuni?o / Schedule a meeting: http://agendar.dalc.in >>>>>>> <https://mailtrack.io/trace/link/db57b837be515d4b7caefe43d55b60467cd7c2c1?url=http%3A%2F%2Fagendar.dalc.in&signature=69b244942739c0f5> >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> API-users mailing list >>>>>> API-users at lists.gbif.org >>>>>> http://lists.gbif.org/mailman/listinfo/api-users >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> API-users mailing listAPI-users at >>>>> lists.gbif.orghttp://lists.gbif.org/mailman/listinfo/api-users >>>>> >>>>> >>>>> _______________________________________________ >>>>> API-users mailing list >>>>> API-users at lists.gbif.org >>>>> http://lists.gbif.org/mailman/listinfo/api-users >>>>> >>>> >>>> _______________________________________________ >>>> API-users mailing list >>>> API-users at lists.gbif.org >>>> http://lists.gbif.org/mailman/listinfo/api-users >>>> >>>> >>> _______________________________________________ >>> API-users mailing list >>> API-users at lists.gbif.org >>> http://lists.gbif.org/mailman/listinfo/api-users >>> >>> >> > > > -- > Dr. Mauro J. Cavalcanti > E-mail: maurobio at gmail.com > Web: http://sites.google.com/site/maurobio > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gbif.org/pipermail/api-users/attachments/20150909/b4ca9a13/attachment-0001.html>
