Hello, First of all, I would like to congratulate team behind data.gov.in for the effort to create this website. I hope, this will create a bloom of data driven research in governance. It will additionally serve well as a starting point for Computer Science or Statistics Graduate for their projects.
Within this framework of dissemination of certainly valuable data, I would like to provide my inputs after walking through the website and attempting to process some data-sets. In this context, I would like to also add that my point of you is entirely from engineers and software integrators perspective. 1) Clear explanation of fields (Issue of lexicon) In several files, the data contains abbreviations which have not been fully explained. MGNREGA is quite popular scheme, but it stands as an exception. It is easier to compress terminology when individuals you are working with on day to day, even essential. But an outsider who plans to use this data will need to visit concerned department or research through Internet to get to understand what the data means. She might as well go and collect data from concerned department in the first place. 2) Question of relations Several data sets are related to each other. To illustrate "Summary Of Railway Statistics From 2002-03 To 2010-11" is kind of aggregate parent of "Number Of Persons Killed And Injured In Railway Related Accidents From 2002-03 To 2010-11". But this relation has to be figured out by consumer of the website herself. The sites does not help establishing that relation by default. You could definitely filter data-sets by ministry but that necessarily may not be related data. 3) Question of data dimensions Excel is great data tool. In fact, I would go further and say, for several people it may be their first introduction to programming. Sadly Excel does not help think through data dimensions. That is something user has to do herself. To illustrate and example (fictitious) Companies registered Delhi, 2011 40 Arunachal Pradesh, 2011 10 Arunachal Pradesh, 2012 20 It is clear that 3rd dimension which is year has been compressed. This requires extra effort from data consumer this clean data. It would be helpful if the team does some preliminary checks on that data for these logical follies 4) Non availability of data Not appllicable is different from non-available and is different from zero which is additionally different from empty. In several cases the data points have been marked as NA. What does NA mean in this context? We could assume several things namely: - Data is not available - Data is not applicable - Data is zero - It is empty data These are different from each other in sometimes subtle and sometimes not so subtle ways. I think data should have clear labelling of these four types 5) Certificate issue (important) There are several file format options on data-set. Except for Excel no other format is usable. I would go further and say that people in data.gov.in team have not tested other formats at all. The certificate for https is not valid. NIC root certificate is not recognized by any browser. However hard anyone tells me, I will not install a certificate because people at NIC are lazy enough to not get their certificate included in all the browsers. In several cases, it is not even possible to install a certificate. Like in case, if someone uses a tablet to visit the site. Additionally browsers are not the only http clients. I use R and I could import data into R directly if valid certificate exists. To show an example of how my R session went with data.gov.in data R version 2.15.3 (2013-03-01) -- "Security Blanket" Copyright (C) 2013 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: i686-pc-linux-gnu (32-bit) > require("RCurl") Loading required package: RCurl Loading required package: bitops > read.table(textConnection(getURL(" https://datacms.nic.in/datatool/?url=http://www.data.gov.in//sites/default/files/DETAILS_OF_GROSS_TRAFFIC_EARNINGS_1.XLS&format=jsonp "))) Error in function (type, msg, asError = TRUE) : SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed > R along with SAS and SPSS will be the primary tools of processing data I assume. I think comprehensive testing should be done that the availability of data via these tools is flawless I am sure team at data.gov.in will address at-least some the issues immediately. I wish best for their endeavour. -- Supreet Sethi Ph IN: +919811143517 Ph Skype: d_j_i_n_n Profile: http://www.google.com/profiles/supreet.sethi Twt: http://twitter.com/djinn _______________________________________________ Ilugd mailing list Ilugd@lists.linux-delhi.org http://frodo.hserus.net/mailman/listinfo/ilugd