Possibly of interest:
Race and ethnicity Imputation from Disease history with Deep LEarning
https://github.com/jisungk/riddle
Bill
On 7/6/17 6:00 PM, Bill Ross wrote:
Unless the data concretely promotes discrimination, it seems
discriminatory to exclude it.
Bill
On 7/6/17 5:39 PM, Sebastian Raschka wrote:
I think there can be some middle ground. I.e., adding a new, simple
dataset to demonstrate regression (maybe autmpg, wine quality, or sth
like that) and use that for the scikit-learn examples in the main
documentation etc but leave the boston dataset in the code base for
now. Whether it's a weak argument or not, it would be quite
destructive to remove the dataset altogether in the next version or
so, not only because old tutorials use it but many unit tests in many
different projects depend on it. I think it might be better to phase
it out by having a good alternative first, and I am sure that the
scikit-learn maintainers wouldn't have anything against it if someone
would update the examples/tutorials with the use of different datasets
Best,
Sebastian
On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.s...@gmail.com>
wrote:
For what it's worth: I'm sympathetic to the argument that you can't
fix the problem if you don't measure it, but I agree with Tony that
"many tutorials use it" is an extremely weak argument. We removed
Lena from scikit-image because it was the right thing to do. I very
much doubt that Boston house prices is in more widespread use than
Lena was in image processing.
You can argue about whether or not it's morally right or wrong to
include the dataset. I see merit to both arguments. But "too many
tutorials use it" is very similar in flavour to "the economy of the
South would collapse without slavery."
Regarding fair uses of the feature, I would hope that all sklearn
tutorials using the dataset mention such uses. The potential for
abuse and misinterpretation is enormous.
On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber
<jmschreibe...@gmail.com>, wrote:
Hi Tony
As others have pointed out, I think that you may be
misunderstanding the purpose of that "feature." We are in agreement
that discrimination against protected classes is not OK, and that
even outside complying with the law one should avoid
discrimination, in model building or elsewhere. However, I disagree
that one does this by eliminating from all datasets any feature
that may allude to these protected classes. As Andreas pointed out,
there is a growing effort to ensure that machine learning models
are fair and benefit the common good (such as FATML, DSSG, etc..),
and from my understanding the general consensus isn't necessarily
that simply eliminating the feature is sufficient. I think we are
in agreement that naively learning a model over a feature set
containing questionable features and calling it a day is not okay,
but as others have pointed out, having these features present and
handling them appropriately can help guard against the model
implicitly learning unfair!
!
biases (e
ven if they are not explicitly exposed to the feature).
I would welcome the addition of the Ames dataset to the ones
supported by sklearn, but I'm not convinced that the Boston dataset
should be removed. As Andreas pointed out, there is a benefit to
having canonical examples present so that beginners can easily
follow along with the many tutorials that have been written using
them. As Sean points out, the paper itself is trying to pull out
the connection between house price and clean air in the presence of
possible confounding variables. In a more general sense, saying
that a feature shouldn't be there because a simple linear
regression is unaffected by the results is a bit odd because it is
very common for datasets to include irrelevant features, and
handling them appropriately is important. In addition, one could
argue that having this type of issue arise in a toy dataset has a
benefit because it exposes these types of issues to those learning
data science earlier on and allows them to keep these issues in
mind in the futur!
e!
when the
data is more serious.
It is important for us all to keep issues of fairness in mind when
it comes to data science. I'm glad that you're speaking out in
favor of fairness and trying to bring attention to it.
Jacob
On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante
<sean.viola...@gmail.com> wrote:
G Reina
you make a bizarre argument. You argue that you should not even
check racism as a possible factor in house prices?
But then you yourself check whether its relevant
Then you say
"but I'd argue that it's more due to the location (near water, near
businesses, near restaurants, near parks and recreation) than to
the ethnic makeup"
Which was basically what the original authors wanted to show too,
Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for
clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
but unless you measure ethnic make-up you cannot show that it is
not a confounder.
The term "white flight" refers to affluent white families moving to
the suburbs.. And clearly a question is whether/how much was racism
or avoiding air pollution.
On 6 Jul 2017 6:10 pm, "G Reina" <gre...@eng.ucsd.edu> wrote:
I'd like to request that the "Boston Housing Prices" dataset in
sklearn (sklearn.datasets.load_boston) be replaced with the "Ames
Housing Prices" dataset
(https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
willing to submit the code change if the developers agree.
The Boston dataset has the feature "Bk is the proportion of blacks
in town". It is an incredibly racist "feature" to include in any
dataset. I think is beneath us as data scientists.
I submit that the Ames dataset is a viable alternative for learning
regression. The author has shown that the dataset is a more robust
replacement for Boston. Ames is a 2011 regression dataset on
housing prices and has more than 5 times the amount of training
examples with over 7 times as many features (none of which are
morally questionable).
I welcome the community's thoughts on the matter.
Thanks.
-Tony
Here's an article I wrote on the Boston dataset:
https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn