Dear gnome-infrastructure masters, Here is a super sincere request from a software engineering researcher in Peking University, regarding gnome bugzilla data.
It's a long story, please allow me to waste a little bit of your time. First, the data I want is as following. For every bug, I want two pages (with full information, in particular, the performer's email), e.g., https://bugzilla.gnome.org/show_bug.cgi?ctype=xml&id=325562 https://bugzilla.gnome.org/show_activity.cgi?id=325562 Second, the reason why I want this data is as following. In order for your understanding, I'll try to give some details. I'm investigating long term contributors (LTCs, who stay with the project for at least three years and are above 10-th percentile of bugs per year) in OSS projects, in particular, how their attitude and environment in their first month with the project impacting their chance of becoming LTCs, hoping to understand best practices and help design a better community architecture. I used issue workflow recorded in bug tracking system to conduct this study. For example, the first thing I need to do is to locate LTCs. I consider the first time a contributor did activity in Bugzilla as his joining day, and the period between his joining day and the day he made the last activity (before the day the data was retrieved) as the duration he stays with the project. Based on this, I model people's attitude and environment through issue workflow, e.g., I found a newcomer's pro-community attitude represented by her first contribution being a comment on an existing issue instead of a bug report or a report through Bugzilla interface instead of a crashreporting tool double her odds of becoming an LTC. Having any of issues reported during the first month to be fixed has the same effect. Her micro-climate represented by low attention of too rapid response, and her macro-climate represented by the increased project popularity reduce her odds. And so forth. In all these calculations, to locate real people from their logins, names, emails, activities are extremely important. In general, I consider person's email as his handle that distinguishes him from anybody else, and this is a relative good approach (activity page only has emails, and not everybody has a name in the information page). If people have multiple emails, I also have a way to deal with that -- out of the scope ... However, it is extremely difficult to deal with the bugzilla extract for gnome we have, the reason is as following. Indeed we retrieved gnome bugzilla once in Jan 2011. We understand the retrieve may cause problems for gnome bugzilla, therefore we were very careful about that. The retrieve was done without logging in, therefore the data doesn't have emails for performers. For example, it is jhs instead of [email protected] for each activity he did in the data. Or, bugzilla-gnome instead of [email protected]. The issue with this data is, too many people share the former part of their emails. For example, as for jhs, u2n;jhs;5680;1;Johnny Haugen Sørgård=1:2010.17307692308:2010.17307692308;Johannes Schmid=4035:2004.36538461538:2011.01923076923 There are many consequences. For example, it's difficult for scripts to determine if jhs is an LTC, therefore I had to drop people with multiple names in the calculation... This certainly hurts the soundness of the study, and I'm not sure to what extent the truth was ignored. In any case, I was wondering if there is a way to get bugzilla extract for gnome. I understand the data may be sensitive to some extent, but I believe this is for good. Understanding gnome practice not only helps other projects, but also helps gnome community itself. I guarantee it will be only used for research purpose. If it's necessary, I could sign any agreement that protects any privacy you people don't want to expose. Sorry if this bothers you too much. Thanks, Minghui Zhou Http://sei.pku.edu.cn/~zhmh _______________________________________________ gnome-infrastructure mailing list [email protected] https://mail.gnome.org/mailman/listinfo/gnome-infrastructure
