Re: [scikit-learn] Replacing the Boston Housing Prices dataset

Sebastian Raschka Thu, 06 Jul 2017 17:41:46 -0700

I think there can be some middle ground. I.e., adding a new, simple dataset to 
demonstrate regression (maybe autmpg, wine quality, or sth like that) and use 
that for the scikit-learn examples in the main documentation etc but leave the 
boston dataset in the code base for now. Whether it's a weak argument or not, 
it would be quite destructive to remove the dataset altogether in the next 
version or so, not only because old tutorials use it but many unit tests in 
many different projects depend on it. I think it might be better to phase it 
out by having a good alternative first, and I am sure that the scikit-learn 
maintainers wouldn't have anything against it if someone would update the 
examples/tutorials with the use of different datasets


Best,
Sebastian

> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.s...@gmail.com> wrote:
> 
> For what it's worth: I'm sympathetic to the argument that you can't fix the 
> problem if you don't measure it, but I agree with Tony that "many tutorials 
> use it" is an extremely weak argument. We removed Lena from scikit-image 
> because it was the right thing to do. I very much doubt that Boston house 
> prices is in more widespread use than Lena was in image processing.
> 
> You can argue about whether or not it's morally right or wrong to include the 
> dataset. I see merit to both arguments. But "too many tutorials use it" is 
> very similar in flavour to "the economy of the South would collapse without 
> slavery."
> 
> Regarding fair uses of the feature, I would hope that all sklearn tutorials 
> using the dataset mention such uses. The potential for abuse and 
> misinterpretation is enormous.
> 
> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreibe...@gmail.com>, 
> wrote:
>> Hi Tony
>> 
>> As others have pointed out, I think that you may be misunderstanding the 
>> purpose of that "feature." We are in agreement that discrimination against 
>> protected classes is not OK, and that even outside complying with the law 
>> one should avoid discrimination, in model building or elsewhere. However, I 
>> disagree that one does this by eliminating from all datasets any feature 
>> that may allude to these protected classes. As Andreas pointed out, there is 
>> a growing effort to ensure that machine learning models are fair and benefit 
>> the common good (such as FATML, DSSG, etc..), and from my understanding the 
>> general consensus isn't necessarily that simply eliminating the feature is 
>> sufficient. I think we are in agreement that naively learning a model over a 
>> feature set containing questionable features and calling it a day is not 
>> okay, but as others have pointed out, having these features present and 
>> handling them appropriately can help guard against the model implicitly 
>> learning unfair biases (e
 ven if they are not explicitly exposed to the feature). 
>> 
>> I would welcome the addition of the Ames dataset to the ones supported by 
>> sklearn, but I'm not convinced that the Boston dataset should be removed. As 
>> Andreas pointed out, there is a benefit to having canonical examples present 
>> so that beginners can easily follow along with the many tutorials that have 
>> been written using them. As Sean points out, the paper itself is trying to 
>> pull out the connection between house price and clean air in the presence of 
>> possible confounding variables. In a more general sense, saying that a 
>> feature shouldn't be there because a simple linear regression is unaffected 
>> by the results is a bit odd because it is very common for datasets to 
>> include irrelevant features, and handling them appropriately is important. 
>> In addition, one could argue that having this type of issue arise in a toy 
>> dataset has a benefit because it exposes these types of issues to those 
>> learning data science earlier on and allows them to keep these issues in 
>> mind in the future when the
  data is more serious.
>> 
>> It is important for us all to keep issues of fairness in mind when it comes 
>> to data science. I'm glad that you're speaking out in favor of fairness and 
>> trying to bring attention to it. 
>> 
>> Jacob
>> 
>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.viola...@gmail.com> 
>> wrote:
>> G Reina 
>> you make a bizarre argument. You argue that you should not even check racism 
>> as a possible factor in house prices? 
>> 
>> But then you yourself check whether its relevant 
>> Then you say 
>> 
>> "but I'd argue that it's more due to the location (near water, near 
>> businesses, near restaurants, near parks and recreation) than to the ethnic 
>> makeup" 
>> 
>> Which  was basically what  the original authors wanted to show too,
>> 
>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean 
>> air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>> 
>>  but unless you measure ethnic make-up you cannot show that it is not a 
>> confounder. 
>> 
>> The term "white flight" refers to affluent white families moving to the 
>> suburbs.. And clearly a question is whether/how much was racism or avoiding 
>> air pollution. 
>> 
>> 
>> 
>> 
>> 
>> On 6 Jul 2017 6:10 pm, "G Reina" <gre...@eng.ucsd.edu> wrote:
>> I'd like to request that the "Boston Housing Prices" dataset in sklearn 
>> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" 
>> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am 
>> willing to submit the code change if the developers agree.
>> 
>> The Boston dataset has the feature "Bk is the proportion of blacks in town". 
>> It is an incredibly racist "feature" to include in any dataset. I think is 
>> beneath us as data scientists.
>> 
>> I submit that the Ames dataset is a viable alternative for learning 
>> regression. The author has shown that the dataset is a more robust 
>> replacement for Boston. Ames is a 2011 regression dataset on housing prices 
>> and has more than 5 times the amount of training examples with over 7 times 
>> as many features (none of which are morally questionable).
>> 
>> I welcome the community's thoughts on the matter.
>> 
>> Thanks.
>> -Tony
>> 
>> Here's an article I wrote on the Boston dataset:
>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Replacing the Boston Housing Prices dataset

Reply via email to