Re: [scikit-learn] Replacing the Boston Housing Prices dataset

Bill Ross Sun, 09 Jul 2017 18:18:06 -0700

And more to the point the discussion on Reddit:

https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/


Bill

On 7/9/17 5:13 PM, Bill Ross wrote:

Possibly of interest:

Race and ethnicity Imputation from Disease history with Deep LEarning

https://github.com/jisungk/riddle

Bill

On 7/6/17 6:00 PM, Bill Ross wrote:
Unless the data concretely promotes discrimination, it seemsdiscriminatory to exclude it.
Bill

On 7/6/17 5:39 PM, Sebastian Raschka wrote:
I think there can be some middle ground. I.e., adding a new, simpledataset to demonstrate regression (maybe autmpg, wine quality, orsth like that) and use that for the scikit-learn examples in themain documentation etc but leave the boston dataset in the code basefor now. Whether it's a weak argument or not, it would be quitedestructive to remove the dataset altogether in the next version orso, not only because old tutorials use it but many unit tests inmany different projects depend on it. I think it might be better tophase it out by having a good alternative first, and I am sure thatthe scikit-learn maintainers wouldn't have anything against it ifsomeone would update the examples/tutorials with the use ofdifferent datasets
Best,
Sebastian
On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias<jni.s...@gmail.com> wrote:
For what it's worth: I'm sympathetic to the argument that you can'tfix the problem if you don't measure it, but I agree with Tony that"many tutorials use it" is an extremely weak argument. We removedLena from scikit-image because it was the right thing to do. I verymuch doubt that Boston house prices is in more widespread use thanLena was in image processing.
You can argue about whether or not it's morally right or wrong toinclude the dataset. I see merit to both arguments. But "too manytutorials use it" is very similar in flavour to "the economy of theSouth would collapse without slavery."
Regarding fair uses of the feature, I would hope that all sklearntutorials using the dataset mention such uses. The potential forabuse and misinterpretation is enormous.
On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber<jmschreibe...@gmail.com>, wrote:
Hi Tony
As others have pointed out, I think that you may bemisunderstanding the purpose of that "feature." We are inagreement that discrimination against protected classes is not OK,and that even outside complying with the law one should avoiddiscrimination, in model building or elsewhere. However, Idisagree that one does this by eliminating from all datasets anyfeature that may allude to these protected classes. As Andreaspointed out, there is a growing effort to ensure that machinelearning models are fair and benefit the common good (such asFATML, DSSG, etc..), and from my understanding the generalconsensus isn't necessarily that simply eliminating the feature issufficient. I think we are in agreement that naively learning amodel over a feature set containing questionable features andcalling it a day is not okay, but as others have pointed out,having these features present and handling them appropriately canhelp guard against the model implicitly learning unfair!
 !
  biases (e
  ven if they are not explicitly exposed to the feature).
I would welcome the addition of the Ames dataset to the onessupported by sklearn, but I'm not convinced that the Bostondataset should be removed. As Andreas pointed out, there is abenefit to having canonical examples present so that beginners caneasily follow along with the many tutorials that have been writtenusing them. As Sean points out, the paper itself is trying to pullout the connection between house price and clean air in thepresence of possible confounding variables. In a more generalsense, saying that a feature shouldn't be there because a simplelinear regression is unaffected by the results is a bit oddbecause it is very common for datasets to include irrelevantfeatures, and handling them appropriately is important. Inaddition, one could argue that having this type of issue arise ina toy dataset has a benefit because it exposes these types ofissues to those learning data science earlier on and allows themto keep these issues in mind in the futur!
e!
   when the
   data is more serious.
It is important for us all to keep issues of fairness in mind whenit comes to data science. I'm glad that you're speaking out infavor of fairness and trying to bring attention to it.
Jacob
On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante<sean.viola...@gmail.com> wrote:
G Reina
you make a bizarre argument. You argue that you should not evencheck racism as a possible factor in house prices?
But then you yourself check whether its relevant
Then you say
"but I'd argue that it's more due to the location (near water,near businesses, near restaurants, near parks and recreation) thanto the ethnic makeup"
Which  was basically what  the original authors wanted to show too,
Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demandfor clean air', J. Environ. Economics & Management, vol.5, 81-102,1978.
but unless you measure ethnic make-up you cannot show that it isnot a confounder.
The term "white flight" refers to affluent white families movingto the suburbs.. And clearly a question is whether/how much wasracism or avoiding air pollution.
On 6 Jul 2017 6:10 pm, "G Reina" <gre...@eng.ucsd.edu> wrote:
I'd like to request that the "Boston Housing Prices" dataset insklearn (sklearn.datasets.load_boston) be replaced with the "AmesHousing Prices" dataset(https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I amwilling to submit the code change if the developers agree.
The Boston dataset has the feature "Bk is the proportion of blacksin town". It is an incredibly racist "feature" to include in anydataset. I think is beneath us as data scientists.
I submit that the Ames dataset is a viable alternative forlearning regression. The author has shown that the dataset is amore robust replacement for Boston. Ames is a 2011 regressiondataset on housing prices and has more than 5 times the amount oftraining examples with over 7 times as many features (none ofwhich are morally questionable).
I welcome the community's thoughts on the matter.

Thanks.
-Tony

Here's an article I wrote on the Boston dataset:
https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Replacing the Boston Housing Prices dataset

Reply via email to