Re: Nutch codebase formatting

2023-10-29 Thread Sebastian Nagel

Hi Lewis,

>> whether we need a Nutch custom code style at all… why don’t we just use
>> some other existing style and then enforce it?

Enforcing: yes!

However, I would try hard to keep the changes on a reasonable minimum. For 
example, if we change the indentation, almost every code line is affected which 
makes

- "git annotate" mostly useless (or more difficult to use because you need look
  back)
- merges of open PRs, custom patches or modifications in custom repositories
  might get quite painful, until the formatting is synchronized.


>> * google Java format [1] which offers a GitHub action for easy integration
>> into our CI process, or

+1

+ available also for Intellij, Eclipse
+ indentation stays the same
+/- about 25% of the code lines are changed (might be acceptable)


>> * superlinter [3] basically emerging as the industry OSS default, offers a
>> GitHub action and could also be configured to lint dockerfile, and other
>> artifacts. It can also be configured to use the google Java style as well…

+1 (with Google Java style)


> I’ll submit a PR for superlinter so everyone can see what it would look like.

Great! Thanks!


Best,
Sebastian

On 10/29/23 00:38, Lewis John McGibbney wrote:

Any thoughts on this folks.
I’ll submit a PR for superlinter so everyone can see what it would look like.
lewismc

On 2023/10/23 19:28:45 lewis john mcgibbney wrote:

Hi dev@,

For the longest time the Nutch codebase has shipped with a
eclipse-codeformat.xml [0] file.
Whilst this has been largely successful in keeping the codebase uniform, it
cannot/has not been integrated into continuous integration (CI)  and
subsequently not really enforced!

Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we
should have some CI code formatting checks. Additionally I really question
whether we need a Nutch custom code style at all… why don’t we just use
some other existing style and then enforce it?

I therefore propose that we replace the legacy code formatter with a
convention such as

* google Java format [1] which offers a GitHub action for easy integration
into our CI process, or
* check style [2] which offers an Ant task which we could use, this is of
less utility as we think about the move to grade
* superlinter [3] basically emerging as the industry OSS default, offers a
GitHub action and could also be configured to lint dockerfile, and other
artifacts. It can also be configured to use the google Java style as well…

My preference would be [3] because it offers a more comprehensive linting
package for the entire codebase not just the Java code.

Thanks for your consideration.
lewismc

[0]
https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml
[1]
https://github.com/google/google-java-format
[2]
https://checkstyle.sourceforge.io/
[3]
https://github.com/marketplace/actions/super-linter



[jira] [Commented] (NUTCH-3014) Standardize Job names

2023-10-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780720#comment-17780720
 ] 

ASF GitHub Bot commented on NUTCH-3014:
---

sebastian-nagel commented on code in PR #789:
URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979


##
src/java/org/apache/nutch/crawl/CrawlDbReader.java:
##
@@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, 
Configuration config)
 
   @Override
   protected int process(String line, StringBuilder output) throws Exception {
-Job job = NutchJob.getInstance(getConf());
+Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + 
this.crawlDb);

Review Comment:
   `this` isn't really required here.





> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-10-29 Thread via GitHub


sebastian-nagel commented on code in PR #789:
URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979


##
src/java/org/apache/nutch/crawl/CrawlDbReader.java:
##
@@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, 
Configuration config)
 
   @Override
   protected int process(String line, StringBuilder output) throws Exception {
-Job job = NutchJob.getInstance(getConf());
+Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + 
this.crawlDb);

Review Comment:
   `this` isn't really required here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org