Re: [scikit-learn] Replacing the Boston Housing Prices dataset

Bill Ross Sun, 09 Jul 2017 18:07:21 -0700

Possibly of interest:

Race and ethnicity Imputation from Disease history with Deep LEarning


https://github.com/jisungk/riddle

Bill

On 7/6/17 6:00 PM, Bill Ross wrote:

Unless the data concretely promotes discrimination, it seemsdiscriminatory to exclude it.
Bill

On 7/6/17 5:39 PM, Sebastian Raschka wrote:
I think there can be some middle ground. I.e., adding a new, simpledataset to demonstrate regression (maybe autmpg, wine quality, or sthlike that) and use that for the scikit-learn examples in the maindocumentation etc but leave the boston dataset in the code base fornow. Whether it's a weak argument or not, it would be quitedestructive to remove the dataset altogether in the next version orso, not only because old tutorials use it but many unit tests in manydifferent projects depend on it. I think it might be better to phaseit out by having a good alternative first, and I am sure that thescikit-learn maintainers wouldn't have anything against it if someonewould update the examples/tutorials with the use of different datasets
Best,
Sebastian
On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.s...@gmail.com>wrote:
For what it's worth: I'm sympathetic to the argument that you can'tfix the problem if you don't measure it, but I agree with Tony that"many tutorials use it" is an extremely weak argument. We removedLena from scikit-image because it was the right thing to do. I verymuch doubt that Boston house prices is in more widespread use thanLena was in image processing.
You can argue about whether or not it's morally right or wrong toinclude the dataset. I see merit to both arguments. But "too manytutorials use it" is very similar in flavour to "the economy of theSouth would collapse without slavery."
Regarding fair uses of the feature, I would hope that all sklearntutorials using the dataset mention such uses. The potential forabuse and misinterpretation is enormous.
On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber<jmschreibe...@gmail.com>, wrote:
Hi Tony
As others have pointed out, I think that you may bemisunderstanding the purpose of that "feature." We are in agreementthat discrimination against protected classes is not OK, and thateven outside complying with the law one should avoiddiscrimination, in model building or elsewhere. However, I disagreethat one does this by eliminating from all datasets any featurethat may allude to these protected classes. As Andreas pointed out,there is a growing effort to ensure that machine learning modelsare fair and benefit the common good (such as FATML, DSSG, etc..),and from my understanding the general consensus isn't necessarilythat simply eliminating the feature is sufficient. I think we arein agreement that naively learning a model over a feature setcontaining questionable features and calling it a day is not okay,but as others have pointed out, having these features present andhandling them appropriately can help guard against the modelimplicitly learning unfair!
 !
  biases (e
  ven if they are not explicitly exposed to the feature).
I would welcome the addition of the Ames dataset to the onessupported by sklearn, but I'm not convinced that the Boston datasetshould be removed. As Andreas pointed out, there is a benefit tohaving canonical examples present so that beginners can easilyfollow along with the many tutorials that have been written usingthem. As Sean points out, the paper itself is trying to pull outthe connection between house price and clean air in the presence ofpossible confounding variables. In a more general sense, sayingthat a feature shouldn't be there because a simple linearregression is unaffected by the results is a bit odd because it isvery common for datasets to include irrelevant features, andhandling them appropriately is important. In addition, one couldargue that having this type of issue arise in a toy dataset has abenefit because it exposes these types of issues to those learningdata science earlier on and allows them to keep these issues inmind in the futur!
e!
   when the
   data is more serious.
It is important for us all to keep issues of fairness in mind whenit comes to data science. I'm glad that you're speaking out infavor of fairness and trying to bring attention to it.
Jacob
On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante<sean.viola...@gmail.com> wrote:
G Reina
you make a bizarre argument. You argue that you should not evencheck racism as a possible factor in house prices?
But then you yourself check whether its relevant
Then you say
"but I'd argue that it's more due to the location (near water, nearbusinesses, near restaurants, near parks and recreation) than tothe ethnic makeup"
Which  was basically what  the original authors wanted to show too,
Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand forclean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
but unless you measure ethnic make-up you cannot show that it isnot a confounder.
The term "white flight" refers to affluent white families moving tothe suburbs.. And clearly a question is whether/how much was racismor avoiding air pollution.
On 6 Jul 2017 6:10 pm, "G Reina" <gre...@eng.ucsd.edu> wrote:
I'd like to request that the "Boston Housing Prices" dataset insklearn (sklearn.datasets.load_boston) be replaced with the "AmesHousing Prices" dataset(https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I amwilling to submit the code change if the developers agree.
The Boston dataset has the feature "Bk is the proportion of blacksin town". It is an incredibly racist "feature" to include in anydataset. I think is beneath us as data scientists.
I submit that the Ames dataset is a viable alternative for learningregression. The author has shown that the dataset is a more robustreplacement for Boston. Ames is a 2011 regression dataset onhousing prices and has more than 5 times the amount of trainingexamples with over 7 times as many features (none of which aremorally questionable).
I welcome the community's thoughts on the matter.

Thanks.
-Tony

Here's an article I wrote on the Boston dataset:
https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Replacing the Boston Housing Prices dataset

Reply via email to